Year 4 Report

A. Overview of Work Completed

As I mapped out in my proposal, most of my work in the fourth year continued on from that in the last year -- applying a library of protein family templates to a number of genomes, developing new methods of predicting protein attributes (e.g., function) based on expression data, constructing an appropriate database infrastructure to handle all of this, and scaling everything up to the human genome. Specific highlights of the year were: (i) detailed analysis of pseudogenes in the intergenic regions in the human genome in terms of protein families; (ii) comparison characteristics of human pseudogenes to those in other organisms in terms of protein families; and (iii) participating in more collaborations with experimental scientists in proteomics, such as K Williams and M. Snyder at Yale.

B. Specific Highlights from the Past Year

I summarize below in some detail the specific highlights from the past year, connecting each to the relevant publication (listed in section 5 below).

i. Human Genome Annotation, focusing on Pseudogenes

In addition to analyzing the occurrence of folds and families within the "living" proteome, we can also use them to survey the "dead" pseudogenes and pseudogeneic fragments in intergenic regions. During the past year, we have almost completely moved our analysis to focusing on the human genome. In the human genome, we are focusing more on the analysis of pseudogenes than the analysis of gene families, since only a small fraction of the human genome actually codes for genes (in contrast to the situation in other organisms). We have surveyed the mitochondrial ribosomal proteins pseudogenes in the human genome (Zhang & Gerstein, 2003). These are pseudogenes derived from the proteins that include the mitochondrial ribosome. We found many new ones. We've written a prominent overall survey on the relations of genes and genomes in the human and yeast genomes (Snyder & Gerstein, 2003). We've mapped the SNPs onto human chromosomes 21 and 22 and tried to see how their occurrence related to the occurrence of gene and pseudogenes (Balasubramanian et al., 2002).

Next, we compared the human genome with a number of other eukaryotic genomes. Part of this required us to find the pseudogenes in the fly (Harrison et al., 2003). (In previous years we had looked at the pseudogenes in the worm and yeast.) Based on all our pseudogene assignments, we did a comprehensive comparison of amino acid and nucleotide composition of genes and pseudogenes in the human genome versus a number of other eukaryotic genomes (Echols et al., 2002). We also looked at the differential occurrence of small protein motifs in intergenic regions (Zhang et al, 2002). These motifs are ancient relics of proteins, derived from pseudogene decaying beyond the point of being recognizable.

ii. Analysis of Functional Genomics Data

During the past year, we have continued on our functional genomics studies and done a number of analyses relating the amount of mRNA expression and protein abundance in the yeast genome (Greenbaum et al., 2002). Based on this work, we developed a mathematical formula relating these two quantities, and these formulas are most important for helping to interpret mass spec data. We've also used the Keck funds to participate in some other functional genomics work related to yeast (Giaever et al., 2002).

iii. Classification of Protein Folds and Families

Finally, we have done a small amount of work developing the enlarging the structural and family classification work that we described as an initial part of the Keck grant. This work just came out, studying structural flexibility in a database framework (Krebs et. al. 2002).

iv. Collaborations on Experimental Proteomics

Collaborations with experimentalists are taking place through a number of centers.

CEGS. Our pseudogene work coupled with these collaborations has lead to our inclusion in an NIH Center of Excellence of Genomic Sciences (CEGS), which is focused on constructing large microarrays. In particular, our pseudogene work forms a valuable backdrop to experimental work aimed at accurately identifying genes and annotating genomes as well as probing the sequence characteristics of intergenic regions.

NESG. As was the case last year, we have done a variety of computational analyses designed to interface with experimental structural genomics. This work is the direct experimental complement of the computational work proposed in the grant. Our efforts are part of the North East Structural Genomics Consortium (NESG). Continuing on with efforts last year, we have designed approaches to pick targets prospectively for subsequent structural analysis and then to do retrospective data mining on the results.

Mass-Spec. Finally, our work relating mRNA abundance and levels of gene expression has been an integral part of the NHLBI/Yale Proteomics Center, focusing on mass-spec, which has just been funded and created. As part of the work associated with this center we have been collaborating with B Konigsberg in relation to developing standardized vocabularies for describing proteins.