During the past year (2003 to 2004, academic year), the Gerstein lab has participated in the Yale CEGS in a number of ways:  


1.  We have developed useful tools for large-scale microarray analysis.


2.  We have developed methods for annotating intergenic regions in the human genome, focusing on assigning genes and pseudogenes. 


3.  We have used these microrarray tools as gene identifications and pseudogene identifications with the Snyder and Weissman laboratories.



Tools Development


This year we have no specific publications on tool development. In general, our group is developing novel microarray technologies and their application to the large-scale mapping of genetic features in humans and model organisms. We are designing maximal-coverage DNA amplicon arrays and high-density oligonucleotide tiling arrays to empirically annotate transcriptional activity in the human genome (Rinn et al.  2003), identify transcription factor binding sites on a chromosome-wide scale in humans (Horak et al. 2002; Euskirchen et al. 2004; Martone et al. 2003) and yeast (Horak et al. 2002), and measure global changes in gene expression during specific cellular developmental stages (Horak et al. 2002; Lian et al. 2001; Lian et al. 2002; Rinn et al. 2004). These approaches facilitate the construction of gene regulatory networks based on experimental (Horak et al. 2002) and computational (Qian et al. 2003; Yu et al. 2003; Zhang and Gerstein 2003; Jansen et al. 2003) inference. We are able to effectively utilize these emerging technologies to answer fundamental biological questions through an integrated design/experiment/analysis methodology. This requires the tandem development of new algorithms for DNA microarray design (Berman et al. 2002), computational methods for efficient microarray data analysis (Luscombe et al. 2003; Kluger et al. 2003; Bertone and Gerstein 2001), and database systems to coordinate the capture of experimental information (Cheung et al. 2002).


Intergenic Annotatation, focusing on Pseudogenes


Developing computational approaches for the annotation of intergenic regions goes hand-in-hand with experimental characterization on these regions. We have focussed particularly on the identification and characterization of pseudogenes in intergenic regions. In addition to the evolutionary importance of pseudogenes, their characterization is crucial for accurately annotating genes and for understanding the degree of cross-hybridization in tiling microarray experiments.

Last year, in addition to our previous studies, we extended our analysis and understanding of pseudogenes with the following studies. We integrated and reviewed studies and data on human pseudogenes (including duplicated and processed), and analyzed methods and results of different research groups (Zhang & Gerstein, in press). We discussed pseudogenes from other animals besides human, and presented ideas about how some pseudogenes could function in the genome. This study likely helps to improve the accuracy of gene annotation in the human genome. We systematically identified about 5000 pseudogenes in the mouse genome. We compared pseudogenes in the mouse and human, and analyzed their lineages. We found that similar types of genes (house keeping genes and ribosomal protein genes) give rise to many processed pseudogenes (Zhang et al. 2004). Using a pipeline we developed, we created a comprehensive catalog of about 8000 processed pgenes in the human genome, using a set of defined criteria (such as intron-absence, frame disruption, polyadenylation, and truncation). These pseudogenes were examined for features such as GC-content and similarity to Alus sequence and L1-repeats (Zhang et al. 2003). To define and characterize pseudogenes, we analyzed the patterns of nucleotide substitution, insertion and deletion (indel) of ribosomal pseudogenes, and inferred how these patterns drive and shape the human genome (Zhang & Gerstein 2003).


Experimental Collaborations


During the past year, we coupled on microarray analysis tools and our gene and pseudogene assignment to aid in specific experimental collaborations associated with the CEGS Center. In the past year we made use of the chromosome 22 amplicon microarrays that we developed, to investigate transcription factor (TF) binding. Using chromatin immunoprecipitation techniques and the subsequent hybridization to these microarrays (so called 'ChIP-chip') we were able to map out the binding sites of NF- kappaB (Martone et al. 2003) and CREB (Euskirchen et al. 2004) on chromosome 22. We also performed a detailed analysis to investigate the differences in gene expression between genders for different somatic and sexual tissues for both Mouse and Human (Rinn et al. 2004). Using Affymetrix Genechips, we were able to identify a number of genes which might play a role in both drug metabolism and renal function which show significant differences between the sexes.


CEGS Publications
(supported directly by the 2003 to 2004 CEGS funds)


CREB binds to multiple loci on human chromosome 22.
G Euskirchen, TE Royce, P Bertone, R Martone, JL Rinn, FK Nelson, F Sayward, NM Luscombe, P Miller, M Gerstein, S Weissman, M Snyder (2004) Mol Cell Biol 24: 3804-14.

Distribution of NF-kappaB-binding sites across human chromosome 22.
R Martone, G Euskirchen, P Bertone, S Hartman, TE Royce, NM Luscombe, JL Rinn, FK Nelson, P Miller, M Gerstein, S Weissman, M Snyder (2003) Proc Natl Acad Sci U S A 100: 12247-52.

Major molecular differences between mammalian sexes are involved in drug metabolism and renal function.
JL Rinn, JS Rozowsky, IJ Laurenzi, PH Petersen, K Zou, W Zhong, M Gerstein, M Snyder (2004) Dev Cell 6: 791-800.

Large-scale Analysis of Pseudogenes in the Human Genome
Z Zhang, M Gerstein. Current Opinion in Genetics Development (in press).

Comparative analysis of processed pseudogenes in the mouse and human genomes.
Z Zhang, N Carriero, M Gerstein (2004) Trends Genet 20: 62-7.

Millions of years of evolution preserved: a comprehensive catalog of the processed pseudogenes in the human genome.
Z Zhang, PM Harrison, Y Liu, M Gerstein (2003) Genome Res 13: 2541-58.

Patterns of nucleotide substitution, insertion and deletion in the human genome inferred from pseudogenes.
Z Zhang, M Gerstein (2003) Nucleic Acids Res 31: 5338-48.


Other Publications
(not supported directly by the 2003 to 2004 CEGS funds or from an earlier period of CEGS funding)


The transcriptional activity of human Chromosome 22.
JL Rinn, G Euskirchen, P Bertone, R Martone, NM Luscombe, S Hartman, PM Harrison, FK Nelson, P Miller, M Gerstein, S Weissman, M Snyder (2003) Genes Dev 17: 529-40.

GATA-1 binding sites mapped in the beta-globin locus by using mammalian chIp-chip analysis.
CE Horak, MC Mahajan, NM Luscombe, M Gerstein, SM Weissman, M Snyder (2002) Proc Natl Acad Sci U S A 99: 2924-9.

Prediction of regulatory networks: genome-wide identification of transcription factor targets from gene expression data.
J Qian, J Lin, NM Luscombe, H Yu, M Gerstein (2003) Bioinformatics 19: 1917-26.

Genomic analysis of gene expression relationships in transcriptional regulatory networks.
H Yu, NM Luscombe, J Qian, M Gerstein (2003) Trends Genet 19: 422-7.

Reconstructing genetic networks in yeast.
Z Zhang, M Gerstein (2003) Nat Biotechnol 21: 1295-7.

A Bayesian networks approach for predicting protein-protein interactions from genomic data.
R Jansen, H Yu, D Greenbaum, Y Kluger, NJ Krogan, S Chung, A Emili, M Snyder, JF Greenblatt, M Gerstein (2003) Science 302: 449-53.

Fast optimal genome tiling with applications to microarray design and homology search.
P Berman, P Bertone, B DasGupta, M Gerstein, M-Y Kao, M Snyder. (2002) Proceedings of the 2nd International Workshop on Algorithms in Bioinformatics. Springer-Verlag LNCS 2452: 419-433

ExpressYourself: A modular platform for processing and visualizing microarray data.
NM Luscombe, TE Royce, P Bertone, N Echols, CE Horak, JT Chang, M Snyder, M Gerstein (2003) Nucleic Acids Res 31: 3477-82.

Spectral biclustering of microarray data: coclustering genes and conditions.
Y Kluger, R Basri, JT Chang, M Gerstein (2003) Genome Res 13: 703-16.

Integrative data mining: the new direction in bioinformatics.
P Bertone, M Gerstein (2001) IEEE Eng Med Biol Mag 20: 33-40.

YMD: a microarray database for large-scale gene expression analysis.
KH Cheung, K White, J Hager, M Gerstein, V Reinke, K Nelson, P Masiar, R Srivastava, Y Li, J Li, H Zhao, J Li, DB Allison, M Snyder, P Miller, K Williams (2002) Proc AMIA Symp 140-4.

Complex transcriptional circuitry at the G1/S transition in Saccharomyces cerevisiae.
CE Horak, NM Luscombe, J Qian, P Bertone, S Piccirrillo, M Gerstein, M Snyder (2002) Genes Dev 16: 3017-33.

Genomic and proteomic analysis of the myeloid differentiation program.
Z Lian, L Wang, S Yamaga, W Bonds, Y Beazer-Barclay, Y Kluger, M Gerstein, PE Newburger, N Berliner, SM Weissman (2001) Blood 98: 513-24.

Genomic and proteomic analysis of the myeloid differentiation program: global analysis of gene expression during induced differentiation in the MPRO cell line.
Z Lian, Y Kluger, DS Greenbaum, D Tuck, M Gerstein, N Berliner, SM Weissman, PE Newburger (2002) Blood 100: 3209-20.