1.            Narrative Description of Progress since Last Report

A.           Overview

 

The lab's research has progressed in all areas elaborated on last year.  Broadly, we have developed a number of methods for design of microarrays (tiling arrays) in order to elucidate the relevance of intergenic regions and thus refine the human genome annotation.  In addition, we have used this kind of analyses to further enhance our understanding of pseudogenes, which comprise a substantial part of the intergenic regions.  As an extension of microarray expression data analysis and ChIP-chip experiments, we have developed new methods to integrate the data generated from varied experiments to obtain a more network-oriented view of molecules in the cell. We have critically evaluated the power of genomic feature integration for predicting protein-protein interactions and thus the function of proteins on a genomic scale.

B.           Significant highlights from the past year

 

Here I am including a brief summary of the most exciting research developments in my laboratory over the last year.

 

i.               Human genome annotation, focusing on intergenic regions

 

We are part of the ENCODE Consortium and collectively our aim is to identify functional elements of DNA in 1% of the human genome. To this end, we are involved in developing and analyzing chip-based transcriptome analysis and detailed analyses of the ENCODE region, which consists of about 30 Mb of human genome sequence selected based on several criteria. We are also carefully mapping both processed and unprocessed pseudogenes in the ENCODE region in order to thoroughly annotate this region. This was elaborated in the Consortium paper published in Science in 2004.

 

On a broader level, we have identified transcriptionally active regions in the entire human genome using tiling arrays representing sense and anti-sense strands of the entire non-repetitive sequence of the human genome. This work was published in Science (Bertone et al, 2004). In addition to known protein coding regions, we found about 10,500 novel transcribed sequences. These results provide a draft expression map for the entire human genome. Interestingly, a large amount of such sequences are in intergenic regions and, in addition, many of these sequences are conserved amongst several mammalian species. This suggests that such regions of intergenic DNA must have specific cellular roles, perhaps of regulatory or structural importance.

 

We have recently published a very detailed analysis of pseudogenes in chromosome 22 and evidence for some transcribed pseudogenes (Zheng et al., 2005). We integrated pseudogene annotations published by various research groups and made a unified list of 525 pseudogenes in Chromosome 22.   We integrated this information with the global transcription analyses of the human genome as described in the previous paragraph.  About one fifth of the pseudogenes on Chr22 appear to be transcribed, again hinting at hitherto unknown functions for pseudogenes.

 

Furthermore, we have computationally assessed processed pseudogenes throughout the genome to look for evidence of transcription. Based on intersection with the global tiling data and expressed sequence tags, we find that about 5% of processed pseudogenes are indeed transcribed. This is an exciting new era for study of pseudogenes.  This was published recently in Nucleic Acids Research (Harrison et al., 2005). While pseudogenes have been widely regarded as non-functional, these latest two studies suggest that, at least in some cases, this may not hold true.

 

ii.              Biological networks and network analysis

 

Molecular network analysis is at the core of current functional genomic research.  Integration of several different kinds of high-throughput data such as mRNA expression profiles, ChIP-chip data, protein-protein interaction data is essential to understand biological processes as all of them are interconnected.  We have made significant progress in developing a methodology for network analysis. We briefly summarize the results of two important papers in this regard. We have also written an extensive review on biological networks and analysis (Xia et al, 2004).

 

Transcriptional regulatory networks play a central role in directing gene expression changes in response to internal and external stimuli. Biological networks exhibit complex dynamic behavior, thereby enabling cells to react to varied conditions. We examined the dynamics of the regulatory system in yeast on a genomic scale by integrating gene expression data for five cellular conditions with known transcriptional regulatory relationships (Luscombe et al., 2004). This work was published in Nature. By integrating (condition unspecific) regulation and condition specific data (e.g. microarray data from different conditions) we developed a trace-back algorithm to uncover subnetworks that are active under specific conditions. To rigorously compare these condition-specific subnetworks, we developed SANDY (Statistical Analysis of Network Dynamics). We show that different sets of transcription factors become key regulatory hubs at different times, portraying a network that shifts its weight between different foci to bring about distinct cellular states. As highly connected transcription factors have a tendency to be lethal when removed from the system, the transient nature of hubs has implications for the condition-dependent lethality for these transcription factors.

 

We have quantitatively evaluated the importance of non-essential genes based on protein-protein interaction networks and analysis of the topological characteristics of networks (Yu et al, 2004). Non-essential genes make significant, but small, contributions to the fitness of the cell, but the effects may not be large enough to be detected by conventional methods. We have evaluated Ňmarginal essentialityÓ which we define as a quantitative measure of a non-essential geneŐs importance to a cell. We calculated marginal essentiality values for yeast genes using the results from a diverse set of four large-scale functional genomics experiments examining different aspects of a proteinŐs impact on cell fitness. Marginally essential genes tend to occupy network hubs just as essential genes do. Marginally essential genes are more likely to have more interaction partners than non-essential genes. Such kinds of network analyses enable one to identify backup genes -- i.e. genes that are not essential for survival of an organism by themselves, but in the absence of another gene are lethal.  If moved to pathogens, such studies could be potentially useful in identifying drug targets in these microbes. It is possible that a drug intended for a specific target protein may not work because of the presence of a backup gene. Targeting this latter gene could result in a better antibiotic against the invading pathogen.

 

iii.            Collaborations on Experimental Proteomics

 

Our collaborations with experimentalists are continuing through a number of centers, as they were last year. 

 

NBC. We are involved with the NBC, the Northeast Biodefense Center, which involves Columbia, Yale, and a number of other northeastern universities. The Gerstein lab contribution is part of the informatics core. 

 

CEGS.  Our pseudogene research is undertaken as part of an NIH Center of Excellence in Genomic Sciences (CEGS), which is focused on constructing large human microarrays.  In particular, our pseudogene work forms a valuable backdrop to experimental work aimed at accurately identifying genes and annotating genomes, as well as probing the sequence characteristics of intergenic regions.

 

NESG.  As was the case last year, we have performed a variety of computational analyses designed to interface with experimental structural genomics.  This is the direct experimental complement of the computational analysis proposed in the grant.  Our efforts are part of the North East Structural Genomics Consortium (NESG). Continuing last yearŐs efforts, we have designed approaches to pick targets prospectively for subsequent structural analysis, followed by retrospective data mining on the results.

 

Mass-Spec Finally, our work relating mRNA abundance and levels of gene expression has become an integral part of the NHLBI/Yale Proteomics Center, focusing on proteomics and mass-spec.  As part of the work associated with this center; we have been collaborating with Ken Williams, Director of the Keck Foundation Biotechnology Resource Laboratory at Yale, on relating mRNA expression and protein abundance.  

 

iv.            Collaboration with Other Keck scholars.  

 

One of the exciting developments resulting from my Keck funding has been the collaborations that have sprouted with other Keck Distinguished Young Scholars from several different classes.  These collaborations have essentially developed through interactions promoted by the Keck symposium. Specifically, these collaborations are:

 

Kevin White, Yale University (2003 awardee).   We are working with him on identifying primers in the human genome for microarrays and we are also interacting a bit on the construction of genetic networks. We just recently had a collaborative paper published on microarray analysis (Gilad et al., 2005).

 

John Moran, University of Michigan (2000 awardee).  We have decided to co-organize a FASEB meeting on retro-elements and pseudogenes in the eukaryotic genomes. I will chair a session at this meeting in June of 2005.

 

Judith Frydman, Stanford University (1999 awardee). Our collaboration focuses on determining whether there are different proteomic properties associated with yeast proteins that are substrates for the chaperonin TRiC rather than for yeast proteins in general.  We have found some interesting indications that certain folds, such as the WD40, tend to be preferred by the chaperone and we are investigating other properties such as function and contact order.  We currently have one paper on this topic about to be submitted.

 

2.            Related Activities (over the past year)

A.           Selected Publications

 

Below are listed all of my papers published during the past two years. Reprints are enclosed for those highlighted "{keck reprint}".  I have made a website of all relevant Keck Foundation publications and URLs: http://bioinfo.mbb.yale.edu/papers/grant/keck.

 

-- 2005 --

 

D Zheng, Z Zhang, PM Harrison, J Karro, N Carriero, M Gerstein (2005). "Integrated pseudogene annotation for human chromosome 22: evidence for transcription." J Mol Biol 349: 27-45. {keck reprint}

P Bertone, M Gerstein, M Snyder (2005). "Applications of DNA tiling arrays to experimental genome annotation and regulatory pathway discovery." Chromosome Res 13: 259-74.

Y Gilad, SA Rifkin, P Bertone, M Gerstein, KP White (2005). "Multi-species microarrays reveal the effect of sequence divergence on gene expression profiles." Genome Res 15: 674-80. {keck reprint}

PM Harrison, D Zheng, Z Zhang, N Carriero, M Gerstein (2005). "Transcribed processed pseudogenes in the human genome: an intermediate form of expressed retrosequence lacking protein-coding ability." Nucleic Acids Res 33: 2374-83.

D Huber, D Boyd, Y Xia, MH Olma, M Gerstein, J Beckwith (2005). "Use of Thioredoxin as a Reporter To Identify a Subset of Escherichia coli Signal Sequences That Promote Signal Recognition Particle-Dependent Translocation." J Bacteriol 187: 2983-91.

TB Acton, KC Gunsalus, R Xiao, LC Ma, J Aramini, MC Baran, YW Chiang, T Climent, B Cooper, NG Denissova, SM Douglas, JK Everett, CK Ho, D Macapagal, PK Rajan, R Shastry, LY Shih, GV Swapna, M Wilson, M Wu, M Gerstein, M Inouye, JF Hunt, GT Montelione (2005). "Robotic cloning and protein production platform of the northeast structural genomics consortium." Methods Enzymol 394: 210-43.

S Balasubramanian, Y Xia, E Freinkman, M Gerstein (2005). "Sequence variation in G-protein-coupled receptors: analysis of single nucleotide polymorphisms." Nucleic Acids Res 33: 1710-21. {keck reprint}

H Grosshans, T Johnson, KL Reinert, M Gerstein, FJ Slack (2005). "The temporal patterning microRNA let-7 regulates several transcription factors at the larval to adult transition in C. elegans." Dev Cell 8: 321-30.

V Alexandrov, U Lehnert, N Echols, D Milburn, D Engelman, M Gerstein (2005). "Normal modes for predicting protein motions: a comprehensive database assessment and associated Web tool." Protein Sci 14: 633-43.

NR Voss, M Gerstein (2005). "Calculation of standard atomic volumes for RNA and comparison with proteins: RNA is packed more tightly." J Mol Biol 346: 477-92.

D Greenbaum, A Smith, M Gerstein (2005). "Impediments to database interoperation: legal issues and security concerns." Nucleic Acids Res 33: D3-4.

N Carriero, MV Osier, KH Cheung, PL Miller, M Gerstein, H Zhao, B Wu, S Rifkin, J Chang, H Zhang, K White, K Williams, M Schultz (2005). "A high productivity/low maintenance approach to high-performance computation for biomedicine: four case studies." J Am Med Inform Assoc 12: 90-8.

 

-- 2004 --

The ENCODE (ENCyclopedia Of DNA Elements) Project. ENCODE Project Consortium (2004) Science 306: 636-40.

Computational analysis of membrane proteins: genomic occurrence, structure prediction and helix interactions. U Lehnert, Y Xia, TE Royce, CS Goh,Y Liu, A Senes, H Yu, Z Zhang, DM Engelman, M Gerstein (2004). Quaterly Review in Biophysics 37: 1-26.

An XML-Based Approach to Integrating Heterogeneous Yeast Genome Data. KH Cheung, D Pan, A Smith, M Seringhaus, SM Douglas, M Gerstein. 2004 International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences (METMBS); pp 236-242.

EJ White, O Emanuelsson, D Scalzo, T Royce, S Kosak, EJ Oakeley, S Weissman, M Gerstein, M Groudine, M Snyder, D SchAbeler (2004). "DNA replication-timing analysis of human chromosome 22 at high resolution and different developmental states." Proc Natl Acad Sci U S A 101: 17771-6.

P Berman, P Bertone, B Dasgupta, M Gerstein, MY Kao, M Snyder (2004). "Fast optimal genome tiling with applications to microarray design and homology search." J Comput Biol 11: 766-85.

P Bertone, V Stolc, TE Royce, JS Rozowsky, AE Urban, X Zhu, JL Rinn, W Tongprasit, M Samanta, S Weissman, M Gerstein, M Snyder (2004). "Global identification of human transcribed sequences with genome tiling arrays." Science 306: 2242-6. {keck reprint}

N Lin, B Wu, R Jansen, M Gerstein, H Zhao (2004). "Information assessment on predicting protein-protein interactions." BMC Bioinformatics 5: 154.

DA Hall, H Zhu, X Zhu, T Royce, M Gerstein, M Snyder (2004). "Regulation of gene expression by a metabolic enzyme." Science 306: 482-4.

A Kumar, M Seringhaus, MC Biery, RJ Sarnovsky, L Umansky, S Piccirillo, M Heidtman, KH Cheung, CJ Dobry, MB Gerstein, NL Craig, M Snyder (2004). "Large-scale mutagenesis of the yeast genome using a Tn7-derived multipurpose transposon." Genome Res 14: 1975-86.

R Jansen, M Gerstein (2004). "Analyzing protein function on a genomic scale: the importance of gold-standard positives and negatives for network prediction." Curr Opin Microbiol 7: 535-45.

NM Luscombe, MM Babu, H Yu, M Snyder, SA Teichmann, M Gerstein (2004). "Genomic analysis of regulatory network dynamics reveals large topological changes." Nature 431: 308-12. {keck reprint}

Y Liu, PM Harrison, V Kunin, M Gerstein (2004). "Comprehensive analysis of pseudogenes in prokaryotes: widespread gene decay and failure of putative horizontally transferred genes." Genome Biol 5: R64.

Z Zhang, M Gerstein (2004). "Large-scale analysis of pseudogenes in the human genome." Curr Opin Genet Dev 14: 328-35.

Z Wunderlich, TB Acton, J Liu, G Kornhaber, J Everett, P Carter, N Lan, N Echols, M Gerstein, B Rost, GT Montelione (2004). "The protein target list of the Northeast Structural Genomics Consortium." Proteins 56: 181-7.

MM Babu, NM Luscombe, L Aravind, M Gerstein, SA Teichmann (2004). "Structure and evolution of transcriptional regulatory networks." Curr Opin Struct Biol 14: 283-91.

Y Xia, H Yu, R Jansen, M Seringhaus, S Baxter, D Greenbaum, H Zhao, M Gerstein (2004). "Analyzing cellular biochemistry in terms of molecular networks." Annu Rev Biochem 73: 1051-87. {keck reprint}

JL Rinn, JS Rozowsky, IJ Laurenzi, PH Petersen, K Zou, W Zhong, M Gerstein, M Snyder (2004). "Major molecular differences between mammalian sexes are involved in drug metabolism and renal function." Dev Cell 6: 791-800.

 

Supplementary Exhibits


Exhibit 1: Laboratory Personnel

(as of Jan. '05)

 

Research Scientists

 

 

Joel Rozowsky (Assoc.)

 

 

Suganthi Balasubramanian (Assoc.)

 

 

Valery Trifonov (partial commitment)

Nick Carriero (partial commitment)

 

 

 

 

Systems People

 

 

Mihali Felipe

 

 

Michael Wilson

 

 

 

 

Post-doctoral Associates & Fellows

 

 

Anne Counterman

NIH fellowship

 

Olof Emanuelsson

Wallenberg Foundation

 

Chern-Sing Goh

NIH fellowship

 

Philip Kim

 

 

Long (Jason) Lu

 

 

Alberto Paccanaro

Alexander Karpikov   

 

 

Rajkumar Sasidharan

 

 

Yu (Brandon) Xia

Jane Coffin Childs

 

Deyou Zheng

 

 

 

 

PhD Students

 

 

Paul Bertone (MCDB)

 

 

Samuel Flores (Physics)

 

 

Tara Gianoulis (CBB)

 

 

Thomas Royce (CBB)

 

 

Michael Seringhaus (MB&B)

NSERC

 

Andrew Smith (CS)

 

 

Haiyuan Yu (MB&B)

 

 

 

 

Medical Students

 

 

Yuen-Jong Liu

 

 

 

 

Undergrads

 

 

Brian Wayda

 

 

David Lu

 

 

Chantelle Southerland

 

 

Michael Schwartz

 

 

Scott Zhu

 

 

Stephen Kriss

 

 

 

 

Assistant

 

 

Joann DelVecchio

 

 

 

 

 

 

Exhibit 2: Teaching

 

Genomics & Bioinformatics (MB&B 452a/752a)

 
Co-taught whole-semester course, with D Sšll and M Snyder, for graduates & undergraduates. My contribution is half of course on bioinformatics (focused on sequence analysis, databases, large-scale surveys, datamining, expression analysis and macromolecular geometry) and grading. I set up an extensive web site for this course (http://bioinfo.mbb.yale.edu/mbb452a ).