Greenbaum et al
17
Figure 1, Schematic overview of the analysis
On the left side we outline the terms we use to describe the process of gene expression. The coding
section of the genome is transcribed into a population of mRNA transcripts called the
"transcriptome". The transcripts in turn are translated to a population of proteins; we use the term
"translatome" for this protein population rather than the alternative "proteome" because the latter
term may be confounded with the protein complement of the genome (which is not necessarily
associated with a quantitative abundance level).
The matrix in the middle schematically shows an analysis of the three stages of expression. In
general, we define a protein "population" as a set of genes associated with a corresponding number
of expression or abundance levels ("weights"). In the matrix each row represents a weight and each
column a gene set. In particular, we differentiate between the mRNA reference expression set
(GmRNA = GGen), which essentially covers the complete genome, and the reference protein
abundance set (GProt) which contains the proteins in data sets 2-DE #1 and 2-DE #2 (see table 1)
because the protein abundance set is a significantly smaller subset of the genome. By definition,
this subset contains only proteins that can be identified by 2-D gel electrophoresis and is therefore
biased in this sense. The enrichment figures throughout this paper, through a comparison of the
right and left sides of this figure, show the results of the experimental biases of 2D gels on the data
set.
Each pie chart represents a composition of a particular protein feature F (for instance, an amino
acid composition) in a population (represented by the symbol
m)
. We can further look at the
"enrichment" of this feature in one population relative to another (represented by the symbol
D
, see
section "Methods" for an explanation of the formalism).
For simplification, we neglect the effects of post-transcriptional and post-translational
modifications that might alter the features of proteins (they affect the expression levels but this is
largely accounted for by the measurements). In this study we analyze protein features as they are
represented in the genome.
Figure 2, mRNA expression levels vs. protein abundance levels
Part A of this figure shows the reference protein abundance levels plotted against the mRNA
reference expression levels on a log-log scale; this plot is similar to the one reported by Futcher et
al. (1999) earlier. The trend line is described by the equation y = 5.20x0.61 where y represents the
protein abundance level (in units of 103 copies/cell) and x the mRNA expression level (in units of
copies/cell). The dashed lines indicate a distance of 1.85 standard deviations (in the log scale) from
the trend line. The outliers beyond the dashed lines are listed in Part B. For each of these outlier
ORFs we show a description of their function and their respective MIPS categories (the numbers
are defined in Figure 4C). With one exception, all outliers are associated with cellular organization
(MIPS category 30). Those outliers that have a high level of protein abundance relative to the
expected amount of mRNA expression are dominated by the alcohol and G3P dehydrogenases.
Translation-related proteins are prominent in the group of those proteins with low protein