Greenbaum et al
3
INTRODUCTION
With the recent popularity of high-throughput experimentation, biologists have begun to create a
large inventory of scientific data (Claverie 1999; Einarson & Golemis 2000; Epstein & Butow
2000; Shapiro & Harris 2000). Much of this has come from expression experiments, partially
fueled by the advent and continuous evolution of the microarray and Gene Chip systems. These
experiments allow for large scale, comprehensive scans of gene expression within the cell (Schena
et al. 1995; Eisen & Brown 1999; Ferea & Brown 1999; Lipshutz 1999). Expression data sets are
currently the single richest source of information in genomics, and for yeast, expression
information now dwarfs that in the sequence alone. However, "theory" has not kept up with
experimentation in this area, and how to best interpret the vast amount of data generated by these
experiments is still a very open question (Bassett et al. 1996; Wittes & Friedman 1999; Zhang
1999; Gerstein & Jansen 2000; Searls 2000; Sherlock 2000).
Genome-wide experimentation has also been used to directly measure the cellular population of
proteins (protein abundance). (Anderson & Seilhamer 1997; Futcher et al. 1999; Gygi et al. 1999;
Ross-Macdonald et al. 1999) Understanding how protein abundance is related to mRNA transcript
levels is essential for interpreting gene expression and also, more generally, for understanding the
interactions, structures and functions in a cellular system (Hatzimanikatis et al. 1999). Moreover,
as protein concentration, rather than transcript population, is the more relevant variable with respect
to enzyme activity, it is this quantity that connects genomics to the physical chemistry and
dynamics of the cell (Kidd et al 2001). Finally, protein abundance levels may become invaluable
for diagnostic methods as well as for determining new drug targets (Corthals 2000). High-
throughput two-dimensional gel electrophoresis (2-DE), in conjunction with mass spectrometry,
has been used to identify proteins that can then be quantified to determine protein abundance
(Futcher et al. 1999; Gygi et al. 1999; Harry et al. 2000). Other technologies include using random
integration of reporter transposons in yeast (Ross-Macdonald et al. 1999), and modifying the
microarray concept for use with proteins (Lopez 2000; MacBeath & Schreiber 2000; Nelson et al.
2000; Zhu et al. 2000).
Gene expression is indirectly related to cellular protein abundance through the process of
translation. The cell connects mRNA expression and protein abundance through translational
control, which is primarily regulated at the initiation of translation (Lindahl & Hinnebusch 1992;
Jackson & Wickens 1997; Day & Tuite 1998; McCarthy 1998). Much of this control is the result of
multiple cis-acting elements in the mRNA (Jacobs Anderson & Parker 2000). There are large non-
coding regions in each mRNA species devoted to regulation of that mRNA as well as its stability
and degradation properties, including 5` and 3` UTRs, uORFs and uAUGs (Vilela et al. 1998;
Vilela et al. 1999; Morris & Geballe 2000).
Previously, we surveyed the population of protein features -- such as folds, amino acid
composition, and functions -- in yeast, and a number of the other recently sequenced genomes
(Gerstein 1997; Gerstein 1998; Gerstein 1998; Gerstein & Hegyi 1998; Hegyi & Gerstein 1999;
Das & Gerstein 2000; Lin & Gerstein 2000). Others have also done related work (Frishman &
Mewes 1997; Tatusov et al. 1997; Jones 1998; Wallin & von Heijne 1998; Frishman & Mewes
1999; Wolf et al. 1999). Recently, we extended this concept to compare the population of features