Greenbaum et al
19
distributions of random enrichments that can then be compared against the observed enrichments.
In the plot the gray bars represent the observed enrichments already shown in figure 3a. On top of
the gray bars we show standard boxplots of enrichment distributions based on 1000 random
permutations. (The middle line represents the distribution median. The upper and lower sides of
the box coincide with the upper and lower quartiles. Outliers are shown as dots and defined as data
points that are outside the range of the whiskers, the length of which is 1.5 the
interquartile distance.) Based on the random distributions, we can compute one-sided P-values for
the observed enrichments. Amino acids for which the P-values are less than 10-3 are shown in bold
font.
Figure 4, Breakdown of the Transcriptome and Translatome in terms of Broad
Categories relating to Structure, Localization, and Function
All of the subfigures are analogous to the schematic illustration in figure 1.
Part A represents the composition of secondary structure in the different populations. In general,
the secondary structure compositions appear to be relatively stable across the different populations.
The most notable change from genome to translatome is perhaps the depletion of coils -- that is,
relatively unordered structures compared to the more structured helices and sheets -- by about 4%.
Part B represents the distribution of subcellular localizations associated with proteins in the
various populations. We used standardized localizations developed earlier (Drawid & Gerstein
2000), which, in turn, were derived from the MIPS, YPD, and Swiss-Prot databases (Bairoch &
Apweiler 2000; Costanzo et al. 2000; Mewes et al. 2000). The subcellular localization has been
experimentally determined for less than half of the yeast proteins, so our analysis applies only to
this subset. The most notable difference between genome, transcriptome and translatome is the
strong enrichment of cytoplasmic proteins. This is in agreement with our previous observations
(Drawid et al. 2000). This also explains to some degree the observations for the functional classes
in part C. For example, the functional group "energy" is mostly dominated by the highly expressed
glycolytic proteins found in the cytoplasm. The depletion of the functional group "transcription"
makes sense in the light of the strong depletion for nuclear proteins. We have argued before
(Drawid et al. 2000) that the number of proteins in a particular subcellular compartment may be
roughly related to the size of the compartment. For instance, membrane proteins occupy the
relatively small "two-dimensional" space in lipid bi-layers. We also performed a separate,
independent calculation for a more comprehensive list of transmembrane segments, which were
predicted computationally (see caption of Table 1). This largely confirms the result. (Data not
shown.)
Part C shows the division of ORFs into different functional categories (according to the MIPS
classification) in the various populations. Only the largest functional categories of the top level of
the MIPS classification are shown. The group "Other" contains the smaller top-level categories
lumped together. This Other group is different from the group "Unclassified," which contains
genes without any functional description. One complication is that many genes have multiple
functional classifications such that they may be counted in more than one category (this explains
why the group "Unclassified" has only a size of 28% for the genome population although the