YEAR 3 Report

A. Overview of Work Completed

As I mapped out in my proposal, most of my work in the third year continued on from that in the last year -- applying my library of structural and function templates to a moderate number of genomes, developing new methods of predicting protein attributes (e.g., function) based on expression data, and constructing an appropriate database infrastructure to handle all of this. Highlights of the year were: (i) extending the survey of pseudogenes in the worm to those in many eukaryotic genomes, including the human; (ii) participating in more collaborations with experimental scientists in proteomics, such as A. Edwards and C. Arrowsmith in Toronto and S. Weissmann and M. Snyder at Yale; and (iii) publishing a paper identifying the characteristics of unique folds in pathogen genomes. This last result is a first step in identifying families and folds in microbial pathogens that are not in the human genome and holds great promise for suggesting drug targets. I have graphically summarized some aspects of my current work in Exhibit 1. (I will refer to subparts of this figure below.)

We have currently focused work on a standard dataset of 20 genomes, 18 prokaryotes plus yeast and worm. We are working now on fitting the human genome into this framework (see future plans). It is essential to integrate fully the human genome into comparisons with microbial genomes to really define what is unique in humans with respect to microbes.

B. Specific Highlights from the Past Year

I summarize below in some detail the specific highlights from the past year, connecting each to the relevant publication (listed in section 5 below).

i. Analysis of Expression Data in Relation to Protein Interactions

One of the aims of the grant is to develop new approaches to predicting protein properties, such as structure, function, interactions, and localization, on a genome-wide scale. We have achieved this through using gene expression data. A new approach for getting at protein function is clustering gene-expression data from microarrays -- genes that cluster together may be functionally related. [Lian et al. 2001; Greenbaum et al., in press]. We have developed a new method of clustering expression data that finds many time-shifted and inverted relationships in addition to the simultaneous relationships found in other studies, and we have developed a way of quantifying how much a given expression clustering predicts protein functional role or protein-protein interactions [77, 78, 91,95, Exhibit 1-9]. Overall, we find that while expression clustering identifies many new and suggestive functional relationships, it is not strongly predictive in a global sense.

ii. Genome-wide Characterization of Protein Function in Microbes

Following on our analysis of the relation between expression and protein-protein interactions, we explored a number of approaches to large-scale characterizations of protein function, clearly one of the major goals of genome analysis. In addition to microarrays, many other types of functional genomics experiments have recently appeared. Integrating many experiments together with "traditional" sequence information (e.g. motifs, composition, and database matches) clearly should (and indeed does) give better functional predictions, and we believe one the most important uses for proteins and protein families is as scaffolds for achieving large-scale integration [79,82, 89, 93]. In particular, we are collaborating with Professor Michael Snyder on a large-scale functional analysis of the yeast genome. Professor Snyder has developed a system of experimentally assessing the functions of all yeast genes with a protein chips. We have helped interpret these experiments computationally, developing a database for the results and clustering them [86].

iii. Identification of Unique Folds and Families in Microbial Genomes

As more genomes are sequenced, and structures, determined, it has become increasingly possible to characterize a substantial fraction of the folds used in a given organism -- statistically, in the sense of a population census. This allows us to see whether particular folds are more common in certain organisms than in others. In the late 1990s, we were the first laboratory to address questions of this sort, performing comparisons of genomes in terms of folds; this was part of our original Keck proposal. This year, on a data set of 20 genomes, we have found that a number of folds, such as TIM-barrels, occur in every (analyzed) genome, while other folds are missing from certain genomes [102]. Furthermore, we have identified characteristics of the unique folds that tend to occur in pathogen genomes -- they tend to be non-symmetrical.

While we found that the specific most common folds often differed between genomes, in all cases the occurrence of folds (and many other aspects of genomic biology) tends to follow power-law statistics, with a few common ones and many rare ones. We have proposed a simple evolutionary model that naturally gives rise to these statistics [92, Exhibit 1-5].

iv. Pseudogenes

In addition to analyzing the occurrence of folds and families within the "living" proteome, we can also use them to survey the "dead" pseudogenes and pseudogeneic fragments in intergenic regions. Last year, we were one of the first groups to perform comprehensive surveys of pseudogenes on a genome-wide scale in terms of protein families, which we described last year for the worm. This year, we have done subsequent surveys of yeast and human genomes [97, 98, Exhibit 1-7]. Collectively, these allow us to determine the common "pseudofolds" and "pseudofamilies" in various genomes and to address important evolutionary questions about the types of proteins that were present in the past history of an organism. In particular, we have found that duplicated pseudogenes tend to have a very different distribution than one would expect if they were randomly derived from the population of genes in the genome. They tend to lie on the end of chromosomes and have an intermediate composition between that of genes and intergenic DNA. Most importantly, pseudogenes tend to have environmental-response functions. This may be related to their being resurrectable protein parts, and we propose a potential mechanism for achieving this in yeast [99]. Processed pseudogenes, which are common in the human genome, have a very different character. They appear to be randomly inserted from mRNA pool and, hence, show an obvious relationship to mRNA level and intergenic region size.

v. Collaborations on Experimental Proteomics

Our pseudogene work forms a valuable backdrop to experimental work aimed at accurately identifying genes and annotating genomes as well as probing the sequence characteristics of intergenic regions [94,101]. Our pseudogene work coupled with these collaborations has lead to our inclusion in another proteomics center, an NIH Center of Excellence of Genomic Sciences (CEGS), which is focused on constructing large microarrays.

Furthermore, as was the case last year, we have done a variety of computational analyses designed to interface with experimental structural genomics. This work is the direct experimental complement of the computational work proposed in the grant. Our efforts are part of the North East Structural Genomics Consortium (NESG). Continuing on with efforts last year, we have designed approaches to pick targets prospectively for subsequent structural analysis and then to do retrospective datamining of the results. In particular, we have collaborated with the Ontario Proteomics group lead by Dr. C. Arrowsmith and Dr. A. Edwards and with Dr. G. Montelione at Rutgers on analyzing large-scale structural genomics analysis [76]. Our analysis consisted of building decision trees to help predict which proteins would perform well in high-throughput protein purification.

C. Publications

All the references refer to numbers in my sorted list of papers available via http://bioinfo.mbb.yale.edu/papers (in particular, sublink http://bioinfo.mbb.yale.edu/papers/papers-research.shtml )