Greenbaum et al
5
the comparison of individual genes. Previous analyses have shown that differences between
mRNA expression and protein abundance level can be quite dramatic for individual genes. This
may either be due to the noise in the data or to fundamental biological processes. However, our
analyses show that the variation between transcriptome and translatome is much smaller for global
properties that are computed by averaging over the properties of many individual genes.
METHODS
Data Sources Used
For our analysis we culled many divergent data sets, representing protein abundance and mRNA
expression experiments and also other sources of genome annotation. These are all summarized in
Table 1. Briefly, they included two protein abundance sets, measured via 2-dimensional gel
electrophoresis and mass spectrometry. We termed these 2-DE #1 (Gygi et al. 1999) and 2-DE #2
(Futcher et al. 1999). These sets, while admittedly small in comparison to the size of expression
data sets, represent the largest amount of information on protein abundance publicly available at the
present. We also apply our methodology, with limited success, to the semi-quantitative Transposon
insertion data set that measures the LacZ expression of fusion proteins (Ross-Macdonald et al.
1999). Although this set contains many more genes than either of the gel electrophoresis sets, and
thus is an appealing source of protein abundance information, the more qualitative nature of the
data makes comparisons with other data sets difficult.
Our mRNA expression data came from multiple laboratories that used either Gene Chip or SAGE
technology. The Gene Chip sets included the Young Expression Set (Holstege et al. 1998), the
Church Expression Set (Roth et al. 1998) and the Samson Expression Set (Jelinsky & Samson
1999). We used data representing the vegetative state of yeast from all of the above experiments.
We also compiled two reference sets to be used in our comparisons, one for protein abundance and
another for mRNA expression (summarized below). Finally, we used many different types of
genome annotation in our analysis, which are summarized in Table1. In particular, the Munich
Information Center for Protein Sequences (MIPS), a site containing a large number of databases
(Mewes et al. 2000), proved to be an invaluable source of data specifically in regard to functional
categories.
Biases in the Data
There is a caveat to the usage of data from high-throughput experimentation (i.e. microarrays and
two-dimensional gel electrophoresis). With all high throughput expression studies there always
exists the difficulty of maintaining consistent biological and processing conditions across the assay.
Moreover, the databases that annotate the specific genes may not always be accurate (Ishii et al
2000). Gene chip experiments suffer with regard to cross hybridization and the saturation of probes
for the highly expressed genes. SAGE data is not always reliable for assessing ORFs with low
expression levels. With regard to 2D gels, although the technology has undergone many
improvements since its introduction over a quarter century ago (Klose 1975; O'Farrell 1975), there
remain many aspects of the procedure that introduce biases into the data. These include the
inability to resolve membrane proteins (approximately 30% of the genome) and basic proteins
(Gerstein 1998; Krogh et al 2001). Moreover, there exist some biases in the data that, as in any
compilation, reflect the tendencies of the investigator. These include the lack of low abundance