text_figures[1]

Greenbaum et al 5 the comparison of individual genes. Previous analyses have shown that differences between mRNA expression and protein abundance level can be quite dramatic for individual genes. This may either be due to the noise in the data or to fundamental biological processes. However, our analyses show that the variation between transcriptome and translatome is much smaller for global properties that are computed by averaging over the properties of many individual genes. METHODS Data Sources Used For our analysis we culled many divergent data sets, representing protein abundance and mRNA expression experiments and also other sources of genome annotation. These are all summarized in Table 1. Briefly, they included two protein abundance sets, measured via 2-dimensional gel electrophoresis and mass spectrometry. We termed these 2-DE #1 (Gygi et al. 1999) and 2-DE #2 (Futcher et al. 1999). These sets, while admittedly small in comparison to the size of expression data sets, represent the largest amount of information on protein abundance publicly available at the present. We also apply our methodology, with limited success, to the semi-quantitative Transposon insertion data set that measures the LacZ expression of fusion proteins (Ross-Macdonald et al. 1999). Although this set contains many more genes than either of the gel electrophoresis sets, and thus is an appealing source of protein abundance information, the more qualitative nature of the data makes comparisons with other data sets difficult. Our mRNA expression data came from multiple laboratories that used either Gene Chip or SAGE technology. The Gene Chip sets included the Young Expression Set (Holstege et al. 1998), the Church Expression Set (Roth et al. 1998) and the Samson Expression Set (Jelinsky & Samson 1999). We used data representing the vegetative state of yeast from all of the above experiments. We also compiled two reference sets to be used in our comparisons, one for protein abundance and another for mRNA expression (summarized below). Finally, we used many different types of genome annotation in our analysis, which are summarized in Table1. In particular, the Munich Information Center for Protein Sequences (MIPS), a site containing a large number of databases (Mewes et al. 2000), proved to be an invaluable source of data specifically in regard to functional categories. Biases in the Data There is a caveat to the usage of data from high-throughput experimentation (i.e. microarrays and two-dimensional gel electrophoresis). With all high throughput expression studies there always exists the difficulty of maintaining consistent biological and processing conditions across the assay. Moreover, the databases that annotate the specific genes may not always be accurate (Ishii et al 2000). Gene chip experiments suffer with regard to cross hybridization and the saturation of probes for the highly expressed genes. SAGE data is not always reliable for assessing ORFs with low expression levels. With regard to 2D gels, although the technology has undergone many improvements since its introduction over a quarter century ago (Klose 1975; O'Farrell 1975), there remain many aspects of the procedure that introduce biases into the data. These include the inability to resolve membrane proteins (approximately 30% of the genome) and basic proteins (Gerstein 1998; Krogh et al 2001). Moreover, there exist some biases in the data that, as in any compilation, reflect the tendencies of the investigator. These include the lack of low abundance