Analysis of Functional Genomics and Proteomics Data

We have made considerable progress on trying to predict gene and protein expression levels, based on proteomic features. In particular, we published papers relating protein abundance and gene expression levels (Greenbaum et al., 2001, 2003), and integrated this with a variety of protein features, looking at the degree to which protein abundance and gene expression levels differ with respect to various protein features. Specifically, we have proposed a methodology that creates reference data sets removing the biases of individual data sets (Greenbaum et al., 2002). Additionally, instead of comparing individual genes, we compared broad categories of genomic information, finding significant trends in the underlying data. These included an overall weak correlation between mRNA and protein levels, although there were many outlying genes. We have encapsulated all of this into a web-based tool called PARE (proteomics.gersteinlab.org) (Yu et al., submitted).

We then went on from there to develop a model to predict levels of gene expression and protein abundance from various features of the protein sequence (Jansen et al., 2003) -- in particular, from calculating the CAI (Codon Adaptation Index) from various compositional biases. We employ a statistical approach where we fit a number of simple models to the observed gene expression and protein abundance data, trying to update the classic CAI method that was developed ~15 years ago.