Year 5 Report

Interim Report - Distinguished Young Scholars in Medical Research

1. Narrative Description of Progress since Last Interim Report

A. Overview of Work Completed

As I mapped out in my original proposal, most of my work in the fifth year built upon and advanced research from the previous years -- applying a library of protein family templates to a number of genomes, developing new methods of predicting protein attributes (e.g., function) based on expression data, constructing an appropriate database infrastructure to handle all of this, and scaling everything up to the human genome. Specific highlights of the year were: (i) detailed analysis of pseudogenes in the intergenic regions in the human genome in terms of protein families; (ii) predicting protein interactions on a genomic scale; and (iii) participating in more collaborations with experimental scientists in proteomics, many of whom are other Keck scholars.

B. Specific Highlights from the Past Year

I summarize below in some detail the specific highlights from the past year, connecting each to the relevant publication (listed in section 5 below). (References to publications with reprints enclosed are followed by a "*".)

i. Human Genome Annotation, Focusing on Pseudogenes

I have reached the culmination of my project on annotating pseudogenes in the human and other mammalian genomes.  We've published two papers that lay out most of the pseudogenes in the human genome and have compared these to the mouse genome (Zhang et al., 2003*; Zhang et al., 2004).  We've also published a careful review comparing our work to that of some other groups (Zhang & Gerstein, in press).  We're currently undertaking further refinement of our pseudogene calculations.  The analysis of pseudogenes in the human genome should enable us to better identify genes in the genome and has very important evolutionary implications.  That is, the history of our past genes is contained in the human genome and is as important as a list of current genes for piecing together the evolutionary puzzle

We also carried out a comprehensive pseudogene assignment procedure for all the prokaryotes (Harrison et al., 2003*). This provides an ideal comparison with the human—particularly in relation to identifying unique protein features of prokaryotes, which may be useful for antimicrobials.

ii. Predicting Protein Function on a Genomic Scale

We have continued our work predicting protein function on a genomic scale.  This year we published a very important paper in Science (Jansen et al., 2003*) where we predicted many protein-protein interactions for yeast. Then, in an experimental collaboration with our colleagues in Toronto (loosely associated with the NESG consortium, see below) and also with Mike Snyder, the Lewis B. Cullman Professor of Molecular, Cellular and Developmental Biology at Yale, we verified many of these interactions experimentally.  

We have also investigated shifts in the function of proteins through the calculation of an active site conservation ratio (Das & Gerstein, 2004).

iii. Conceptual Work on Biological Databases.

As part of our protein function and human genome annotation work, we've done a significant number of conceptual pieces on biological database development.  We have proposed a number of somewhat controversial ideas about merging the literature and protein databases and have explored various social and legal implications of these (Greenbaum et al., in press; Greenbaum & Gerstein, 2003*).  The only grant support for this novel approach comes from the W.M. Keck Foundation.

iv. Collaborations on Experimental Proteomics

NBC. This year we begun a significant new collaboration, again in the consortium framework. To the groups described below we have added the NBC, the Northeast Biodefense Center, which involves Columbia, Yale, and a number of other northeastern universities. The Gerstein lab contribution is part of the informatics core. We proposed to develop a tool to help with the rapid annotation and characterization of a newly sequenced microbial pathogen. Our tool will comprise a number of modules including large-scale rapid similarity comparisons, structure and localization prediction, rapid information retrieval, and prediction of interactions.

Furthermore, our collaborations with experimentalists are continuing through a number of other centers, as they were last year.

CEGS. Our pseudogene research is undertaken as part of an NIH Center of Excellence of Genomic Sciences (CEGS), which is focused on constructing large human microarrays. In particular, our pseudogene work forms a valuable backdrop to experimental work aimed at accurately identifying genes and annotating genomes as well as probing the sequence characteristics of intergenic regions. This year we were awarded a new ENCODE grant from the NIH on annotating a specific region in the human genome (the ENCODE region). This further bolstered our work on human arrays and adds to the overall CEGS effort.

NESG. As was the case last year, we have done a variety of computational analyses designed to interface with experimental structural genomics. This is the direct experimental complement of the computational analysis proposed in the grant. Our efforts are part of the North East Structural Genomics Consortium (NESG). Continuing last year’s efforts, we have designed approaches to pick targets prospectively for subsequent structural analysis, followed by retrospective data mining on the results.

Mass-Spec. Finally, our work relating mRNA abundance and levels of gene expression has been an integral part of the NHLBI/Yale Proteomics Center, focusing on proteomics and mass-spec. As part of the work associated with this center, we have been collaborating with Ken Williams, Director of the Keck Foundation Biotechnology Resource Laboratory at Yale, on relating mRNA expression and protein abundance.

v. Collaboration with Other Keck scholars.

One of the exciting developments of the past year has been the collaborations that have sprouted with other Keck Distinguished Young Scholars from several different classes.  These collaborations have essentially developed through interactions promoted by the Keck symposium.  Specifically, these collaborations are:

Kevin White, Yale University (2003 awardee). We are working with him on identifying primers in the human genome for microarrays and we are also interacting a bit on the construction of genetic networks.

John Moran, University of Michigan (2000 awardee). We have decided to co-organize a FASEB meeting, which has recently been approved, on retro-elements and pseudogenes in the eukaryotic genomes.

Judith Frydman, Stanford University (1999 awardee). Our collaboration focuses on determining whether there are different proteomic properties associated with yeast proteins that are substrates for the chaperonin TRiC rather than for yeast proteins in general. We have found some interesting indications that certain folds such as the WD40 tend to be preferred by the chaperone, and we are investigating other properties such as function and contact order.

Selected Publications

Below are listed all of my papers published during the past two years. I have made a website of all relevant Keck Foundation publications and URLs: http://papers.gersteinlab.org/papers/grant/keck.

In Press
An analysis of the present system of scientific publishing: what’s wrong and where to go from here. D Greenbaum, J Lim, M Gerstein. Interdiscip Sci Rev. (in press). {keck}
Genomic analysis of essentiality within protein networks.” H Yu, D Greenbaum, H Lu, X Zhu, M Gerstein. Trends in Genetics (in press) .
Analyzing Cellular Biochemistry in Terms of Molecular Networks.” Y Xia, H Yu, R Jansen, M Seringhaus, S Baxter, D Greenbaum, H Zhao, M Gerstein. Annual Review of Biochemistry (in press).
Large-scale Analysis of Pseudogenes in the Human Genome.” Z Zhang, M Gerstein. Current Opinion in Genetics Development (in press). {keck}
Annotation transfer for genomics: assessing the transferability of protein-protein and protein-DNA interactions between organisms.” H Yu, N Luscombe, H Lu, X Zhu, Y Xia, J Han, N Bertin, S Chung, C Goh, M Vidal, M Gerstein. Genome Research (in press) .
An XML-Based Approach to Integrating Heterogeneous Yeast Genome Data.” KH Cheung, D Pan, A Smith, M Seringhaus, SM Douglas, M Gerstein. 2004 International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences (in press).
-- 2004 --
LL Freeman-Cook, AM Dixon, JB Frank, Y Xia, L Ely, M Gerstein, DM Engelman, D DiMaio (2004). "Selection and Characterization of Small Random Transmembrane Proteins that Bind and Activate the Platelet-derived Growth Factor beta Receptor." J Mol Biol 338: 907-20.

CS Goh, D Milburn, M Gerstein (2004). "Conformational changes associated with protein-protein interactions." Curr Opin Struct Biol 14: 104-9.

G Euskirchen, TE Royce, P Bertone, R Martone, JL Rinn, FK Nelson, F Sayward, NM Luscombe, P Miller, M Gerstein, S Weissman, M Snyder (2004). "CREB binds to multiple loci on human chromosome 22." Mol Cell Biol 24: 3804-14.

R Das, M Gerstein (2004). "A method using active-site sequence conservation to find functional shifts in protein families: application to the enzymes of central metabolism, leading to the identification of an anomalous isocitrate dehydrogenase in pathogens." Proteins 55: 455-63. {keck}

M Gerstein, N Echols (2004). "Exploring the range of protein flexibility, from a structural proteomics perspective." Curr Opin Chem Biol 8: 14-9.

Y Liu, M Gerstein, DM Engelman (2004). "Transmembrane protein domains rarely use covalent domain recombination as an evolutionary mechanism." Proc Natl Acad Sci U S A 101: 3495-7.

Z Zhang, N Carriero, M Gerstein (2004). "Comparative analysis of processed pseudogenes in the mouse and human genomes." Trends Genet 20: 62-7. {keck}

CS Goh, N Lan, SM Douglas, B Wu, N Echols, A Smith, D Milburn, GT Montelione, H Zhao, M Gerstein (2004). "Mining the structural genomics pipeline: identification of protein properties that affect high-throughput experimental analysis." J Mol Biol 336: 115-30.

H Yu, X Zhu, D Greenbaum, J Karro, M Gerstein (2004). "TopNet: a tool for comparing biological sub-networks, correlating protein properties with topological statistics." Nucleic Acids Res 32: 328-37.

V Alexandrov, M Gerstein (2004). "Using 3D Hidden Markov Models that explicitly represent spatial coordinates to model and compare protein structures." BMC Bioinformatics 5: 2.

S Li, CM Armstrong, N Bertin, H Ge, S Milstein, M Boxem, PO Vidalain, JD Han, A Chesneau, T Hao, DS Goldberg, N Li, M Martinez, JF Rual, P Lamesch, L Xu, M Tewari, SL Wong, LV Zhang, GF Berriz, L Jacotot, P Vaglio, J Reboul, T Hirozane-Kishikawa, Q Li, HW Gabel, A Elewa, B Baumgartner, DJ Rose, H Yu, S Bosak, R Sequerra, A Fraser, SE Mango, WM Saxton, S Strome, S Van Den Heuvel, F Piano, J Vandenhaute, C Sardet, M Gerstein, L Doucette-Stamm, KC Gunsalus, JW Harper, ME Cusick, FP Roth, DE Hill, M Vidal (2004). "A map of the interactome network of the metazoan C. elegans." Science 303: 540-3.
-- 2003 --
WG Krebs, J Tsai, V Alexandrov, J Junker, R Jansen, M Gerstein (2003). "Tools and databases to analyze protein flexibility; approaches to mapping implied features onto sequences." Methods Enzymol 374: 544-84. {keck}

Y Kluger, H Yu, J Qian, M Gerstein (2003). "Relationship between gene co-expression and probe localization on microarray slides." BMC Genomics 4: 49.

Z Zhang, PM Harrison, Y Liu, M Gerstein (2003). "Millions of years of evolution preserved: a comprehensive catalog of the processed pseudogenes in the human genome." Genome Res 13: 2541-58. {keck reprint}

Y Jiao, H Yang, L Ma, N Sun, H Yu, T Liu, Y Gao, H Gu, Z Chen, M Wada, M Gerstein, H Zhao, LJ Qu, XW Deng (2003). "A genome-wide analysis of blue-light regulation of Arabidopsis transcription factor gene expression during seedling development." Plant Physiol 133: 1480-93. {keck}

Z Zhang, M Gerstein (2003). "Reconstructing genetic networks in yeast." Nat Biotechnol 21: 1295-7.

PM Harrison, N Carriero, Y Liu, M Gerstein (2003). "A "polyORFomic" analysis of prokaryote genomes using disabled-homology filtering reveals conserved but undiscovered short ORFs." J Mol Biol 333: 885-92. {keck reprint}

R Jansen, H Yu, D Greenbaum, Y Kluger, NJ Krogan, S Chung, A Emili, M Snyder, JF Greenblatt, M Gerstein (2003). "A Bayesian networks approach for predicting protein-protein interactions from genomic data." Science 302: 449-53. {keck reprint}

J Qian, J Lin, NM Luscombe, H Yu, M Gerstein (2003). "Prediction of regulatory networks: genome-wide identification of transcription factor targets from gene expression data." Bioinformatics 19: 1917-26.

R Martone, G Euskirchen, P Bertone, S Hartman, TE Royce, NM Luscombe, JL Rinn, FK Nelson, P Miller, M Gerstein, S Weissman, M Snyder (2003). "Distribution of NF-kappaB-binding sites across human chromosome 22." Proc Natl Acad Sci U S A 100: 12247-52.

Z Zhang, M Gerstein (2003). "Patterns of nucleotide substitution, insertion and deletion in the human genome inferred from pseudogenes." Nucleic Acids Res 31: 5338-48.

D Greenbaum, C Colangelo, K Williams, M Gerstein (2003). "Comparing protein abundance and mRNA expression levels on a genomic scale." Genome Biol 4: 117.

D Greenbaum, M Gerstein (2003). "A universal legal framework as a prerequisite for database interoperability." Nat Biotechnol 21: 979-82. {keck reprint}

Z Zhang, M Gerstein (2003). "The human genome has 49 cytochrome c pseudogenes, including a relic of a primordial gene that still functions in mouse." Gene 312: 61-72.

H Yu, NM Luscombe, J Qian, M Gerstein (2003). "Genomic analysis of gene expression relationships in transcriptional regulatory networks." Trends Genet 19: 422-7.

J Qian, Y Kluger, H Yu, M Gerstein (2003). "Identification and correction of spurious spatial correlations in microarray data." Biotechniques 35: 42-4, 46, 48.

M Gerstein, JM Thornton (2003). "Sequences and topology." Curr Opin Struct Biol 13: 341-3.

NM Luscombe, TE Royce, P Bertone, N Echols, CE Horak, JT Chang, M Snyder, M Gerstein (2003). "ExpressYourself: A modular platform for processing and visualizing microarray data." Nucleic Acids Res 31: 3477-82.

Z Zhang, M Gerstein (2003). "Of mice and men: phylogenetic footprinting aids the discovery of regulatory elements." J Biol 2: 11.

PM Harrison, M Gerstein (2003). "A method to assess compositional bias in biological sequences and its application to prion-like glutamine/asparagine-rich domains in eukaryotic proteomes." Genome Biol 4: R40.

MS Kimber, F Vallee, S Houston, A Necakov, T Skarina, E Evdokimova, S Beasley, D Christendat, A Savchenko, CH Arrowsmith, M Vedadi, M Gerstein, AM Edwards (2003). "Data mining crystallization databases: knowledge-based approaches to optimize protein crystal screens." Proteins 51: 562-8.

CS Goh, N Lan, N Echols, SM Douglas, D Milburn, P Bertone, R Xiao, LC Ma, D Zheng, Z Wunderlich, T Acton, GT Montelione, M Gerstein (2003). "SPINE 2: a system for collaborative structural proteomics within a federated database framework." Nucleic Acids Res 31: 2833-8.

Z Zhang, M Gerstein (2003). "Identification and characterization of over 100 mitochondrial ribosomal protein pseudogenes in the human genome." Genomics 81: 468-80.

M Snyder, M Gerstein (2003). "Genomics. Defining genes in the genomics era." Science 300: 258-60.

R Jansen, HJ Bussemaker, M Gerstein (2003). "Revisiting the codon adaptation index from a whole-genome perspective: analyzing the relationship between gene expression and codon occurrence in yeast using a variety of models." Nucleic Acids Res 31: 2242-51.

Y Kluger, R Basri, JT Chang, M Gerstein (2003). "Spectral biclustering of microarray data: coclustering genes and conditions." Genome Res 13: 703-16.

M Gerstein, A Edwards, CH Arrowsmith, GT Montelione (2003). "Structural genomics: current progress." Science 299: 1663.

JL Rinn, G Euskirchen, P Bertone, R Martone, NM Luscombe, S Hartman, PM Harrison, FK Nelson, P Miller, M Gerstein, S Weissman, M Snyder (2003). "The transcriptional activity of human Chromosome 22." Genes Dev 17: 529-40.

PM Harrison, D Milburn, Z Zhang, P Bertone, M Gerstein (2003). "Identification of pseudogenes in the Drosophila melanogaster genome." Nucleic Acids Res 31: 1033-7.

A Savchenko, A Yee, A Khachatryan, T Skarina, E Evdokimova, M Pavlova, A Semesi, J Northey, S Beasley, N Lan, R Das, M Gerstein, CH Arrowmith, AM Edwards (2003). "Strategies for structural proteomics of prokaryotes: Quantifying the advantages of studying orthologous proteins and of using both NMR and X-ray crystallography approaches." Proteins 50: 392-9.

N Lan, GT Montelione, M Gerstein (2003). "Ontologies for proteomics: towards a systematic definition of structure and function that scales to the genome level." Curr Opin Chem Biol 7: 44-54.

N Echols, D Milburn, M Gerstein (2003). "MolMovDB: analysis and visualization of conformational change and structural flexibility." Nucleic Acids Res 31: 478-82.

Presentations - Invited Lectures (since last report)

06/19/03	Albany, NY		Albany 2003: 13th Conversation 	   
07/10/03	Washington, DC		NIH Data Management Workshop for Structural Genomics	   
07/16/03	Meridan, NH		Enzymes Gordon Conf.	   
08/01/03	Upton, NY		Bluegene2003	   
09/08/03	Denver, CO		U of Colorado Health Sciences 	   
09/22/03	Washington, DC		NIH Human Base Workshop	   
10/09/03	Boston, MA		TIGR Computational Genomics Meeting	   
10/12/03	Munich, Germany		GCB'03	   
10/21/03	Pittsburgh, PA		ACS Regional Meeting	   
10/30/03	New Brunswick, NJ	DIMACS Data Mining Workshop	   
01/12/04	Boston, MA		Northeastern U (biochemistry)	   
02/09/04	New York, NY		Hunter College (biology)	   
02/16/04	New York, NY		Columbia U. (C2B2)	   
03/08/04	Philadelphia, PA	SRI Protein Interactions Conference	   
03/05/04	Toronto, Canada		U Toronto (Banting Inst.)	   
04/17/04	Washington, DC		Experimental Biology 2004	   
04/17/04	Lawrenceville, NJ	American Mathematical Society	   
04/30/04	Boston, MA		MIT (CSBi series)	   
05/07/04	Washington, DC		NIH (protig series)	   
05/15/04	Montreal, Canada	Proteomics Initiative (CPI-2004)