During the past year (2003 to 2004, academic year), the Gerstein lab has participated in the Yale CEGS in a number of ways:
1. We have developed useful tools for large-scale microarray analysis.
2. We have developed methods for annotating intergenic regions in the human genome, focusing on assigning genes and pseudogenes.
3. We have used these microrarray tools as gene identifications and pseudogene identifications with the Snyder and Weissman laboratories.
This year we have no specific publications on tool development. In general, our group is developing novel microarray technologies and their application to the large-scale mapping of genetic features in humans and model organisms. We are designing maximal-coverage DNA amplicon arrays and high-density oligonucleotide tiling arrays to empirically annotate transcriptional activity in the human genome (Rinn et al.  2003), identify transcription factor binding sites on a chromosome-wide scale in humans (Horak et al. 2002; Euskirchen et al. 2004; Martone et al. 2003) and yeast (Horak et al. 2002), and measure global changes in gene expression during specific cellular developmental stages (Horak et al. 2002; Lian et al. 2001; Lian et al. 2002; Rinn et al. 2004). These approaches facilitate the construction of gene regulatory networks based on experimental (Horak et al. 2002) and computational (Qian et al. 2003; Yu et al. 2003; Zhang and Gerstein 2003; Jansen et al. 2003) inference. We are able to effectively utilize these emerging technologies to answer fundamental biological questions through an integrated design/experiment/analysis methodology. This requires the tandem development of new algorithms for DNA microarray design (Berman et al. 2002), computational methods for efficient microarray data analysis (Luscombe et al. 2003; Kluger et al. 2003; Bertone and Gerstein 2001), and database systems to coordinate the capture of experimental information (Cheung et al. 2002).
Developing computational approaches for the annotation of intergenic regions goes hand-in-hand with experimental characterization on these regions. We have focussed particularly on the identification and characterization of pseudogenes in intergenic regions. In addition to the evolutionary importance of pseudogenes, their characterization is crucial for accurately annotating genes and for understanding the degree of cross-hybridization in tiling microarray experiments.
Last year, in addition to our previous studies, we extended our analysis and understanding of pseudogenes with the following studies. We integrated and reviewed studies and data on human pseudogenes (including duplicated and processed), and analyzed methods and results of different research groups (Zhang & Gerstein, in press). We discussed pseudogenes from other animals besides human, and presented ideas about how some pseudogenes could function in the genome. This study likely helps to improve the accuracy of gene annotation in the human genome. We systematically identified about 5000 pseudogenes in the mouse genome. We compared pseudogenes in the mouse and human, and analyzed their lineages. We found that similar types of genes (house keeping genes and ribosomal protein genes) give rise to many processed pseudogenes (Zhang et al. 2004). Using a pipeline we developed, we created a comprehensive catalog of about 8000 processed pgenes in the human genome, using a set of defined criteria (such as intron-absence, frame disruption, polyadenylation, and truncation). These pseudogenes were examined for features such as GC-content and similarity to Alus sequence and L1-repeats (Zhang et al. 2003). To define and characterize pseudogenes, we analyzed the patterns of nucleotide substitution, insertion and deletion (indel) of ribosomal pseudogenes, and inferred how these patterns drive and shape the human genome (Zhang & Gerstein 2003).
During the past year, we coupled on microarray analysis tools and our gene and pseudogene assignment to aid in specific experimental collaborations associated with the CEGS Center. In the past year we made use of the chromosome 22 amplicon microarrays that we developed, to investigate transcription factor (TF) binding. Using chromatin immunoprecipitation techniques and the subsequent hybridization to these microarrays (so called 'ChIP-chip') we were able to map out the binding sites of NF- kappaB (Martone et al. 2003) and CREB (Euskirchen et al. 2004) on chromosome 22. We also performed a detailed analysis to investigate the differences in gene expression between genders for different somatic and sexual tissues for both Mouse and Human (Rinn et al. 2004). Using Affymetrix Genechips, we were able to identify a number of genes which might play a role in both drug metabolism and renal function which show significant differences between the sexes.