During the past year (2004 to 2005, academic year), the Gerstein lab has participated in the Yale CEGS in a number of ways:  

1.  We have developed useful tools for large-scale microarray analysis.

2.  We have developed methods for annotating intergenic regions in the human genome, focusing on assigning genes and pseudogenes. 

3.  We have used these microrarray tools as gene identifications and pseudogene identifications with the Snyder and Weissman laboratories.

Tools Development

In general, our group is developing novel microarray technologies and their application to the large-scale mapping of genetic features in humans and model organisms.


Normalization Algorithms


We published a paper on approaches for normalization for tiling microarray experiments (Royce et al., 2005)


Algorithms for normalizing DNA microarray data are becoming more and more standardized.  However, new challenges arise as the scope of the array technology evolves.  The last few years have seen the advent of so-called 'tiling' microarrays where the arrays' probes represent sequences that span stretches of genomic DNA.  Normalization procedures are not necessarily transferable from traditional gene-based arrays.  Our work has included identifying those procedures of the microarray normalization literature that are transferable to tiling arrays and those which need to be modified or replaced.  One issue involved is that a potentially small fraction of probes on an array experience significant levels of nucleic acid hybridization.  Many normalization procedures assume hybridization to at least half of all probes on an array.  Another issue is that tiling arrays for whole genomes, for example, require many arrays - each of different design.  Care must be taken when normalizing these arrays to one another because the underlying distribution of measured signals from arrays tiling different regions of the genome may be very different.  Our work mostly calls for care when designing the arrays so that the expectation is that each array has similar intensity distributions.  This can be achieved with completely randomized designs, for example.    



Tiling Algorithms


We have explored numerous options for tiling genomic sequences with oligonucleotides, leading to microarray designs of various sequence resolutions and feature densities. In addition, we developed an efficient method for rapidly determining the degree of uniqueness of oligonucleotide sequences to compensate for cross-hybridization effects. We have leveraged these techniques to construct a series of oligonucleotide arrays and used them to map transcriptional activity across the human genome at high resolution (Bertone et al. 2004). In terms of amplicon (PCR-based) tiling arrays, we have devised space- and time-efficient algorithms for generating optimal tile paths to improve the coverage of non-repetitive sequences while minimizing the number of repetitive nucleotides included. In this manner, a greater number of fragments of sufficient size are recovered for amplification, and a higher percentage of non-repetitive DNA is represented on the array. Using these methods, we have constructed amplicon arrays spanning all non-repetitive DNA of human chromosome 22; these have been used to map transcriptional activity (Rinn et al. 2003), identify transcription factor binding sites (Martone et al. 2003, Euskirchen et al. 2004), and characterize the timing of chromosome replication (White et al. 2004).


These algorithms have been made available to the general public and implemented as a collection of web-based tools, available at tiling.gersteinlab.org.  Users of the system may upload sequences of arbitrary size and produce optimal tile paths for either oligonucleotide or amplicon-based array designs. For the latter platform, a secondary facility for the automated design of PCR primers is provided, allowing large-scale amplification of the subsequence tiles that comprise the array. Together, these approaches enable the design and construction of both oligonucleotide and amplicon arrays that enable the discovery of novel functional elements in eukaryotic genomes.


Some of this work was described in Berman et al. (2005). We are currently in the process of trying to publish a manuscript on this subject.


Nimblegen Array Tracking Database


The core of the Nimblegen array tracking database was developed. Using our database investigators can submit information about individual experiments and view information about already conducted experiments. We are currently developing microarray inventory and ordering functions of our database.



Intergenic Annotatation,
focusing on Pseudogenes

Developing computational approaches for the annotation of intergenic regions goes hand-in-hand with experimental characterization on these regions. We have focused particularly on the identification and characterization of pseudogenes in intergenic regions. In addition to the evolutionary importance of pseudogenes, their characterization is crucial for accurately annotating genes and for understanding the degree of cross-hybridization in tiling microarray experiments.

Last year, in addition to our previous studies, we extended our analysis and understanding of pseudogenes with the following studies

In Zheng et al. (2005), we used the recent emergence of tiling-microarray data showing that intergenic regions (containing pseudogenes) are transcribed to a great degree. Here we focus on the transcriptional activity of pseudogenes on human chromosome 22. First, we integrated several sets of annotation to define a unified list of 525 pseudogenes on the chromosome. To characterize these further, we developed a comprehensive list of genomic features based on conservation in related organisms, expression evidence, and the presence of upstream regulatory sites. Of the 525 unified pseudogenes we could confidently classify 154 as processed and 49 as duplicated. Using data from tiling microarrays, especially from recent high-resolution oligonucleotide arrays, we found some evidence that up to a fifth of the 525 pseudogenes are potentially transcribed. Expressed sequence tags (EST) comparison further validated a number of these, and overall we found 17 pseudogenes with strong support for transcription. In particular, one of the pseudogenes with both EST and microarray evidence for transcription turned out to be a duplicated pseudogene in the cat eye syndrome critical region. Although we could not identify a meaningful number of transcription factor-binding sites (based on chromatin immunoprecipitation-chip data) near pseudogenes, we did find that approximately 12% of the pseudogenes had upstream CpG islands. Finally, analysis of corresponding syntenic regions in the mouse, rat and chimp genomes indicates, as previously suggested, that pseudogenes are less conserved than genes, but more preserved than the intergenic background (all notation is available from http://www.pseudogene.org).

In Harrison et al. (2005), we survey for an intermediate entity, the transcribed processed pseudogene (TPPsig), which is disabled but nonetheless transcribed. TPPsigs may affect expression of paralogous genes, as observed in the case of the mouse makorin1-p1 TPPsig. To elucidate their role, we identified human TPPsigs by mapping expressed sequences onto processed pseudogenes (PPsigs) and, reciprocally, extracting TPPsigs from known mRNAs. We consider only those PPsigs that are homologous to either non-mammalian eukaryotic proteins or protein domains of known structure, and require detection of identical coding-sequence disablements in both the expressed and genomic sequences. Oligonucleotide microarray data provide further expression verification. Overall, we find 166-233 TPPsigs (approximately 4-6% of PPsigs). Proteins/transcripts with the highest numbers of homologous TPPsigs generally have many homologous PPsigs and are abundantly expressed. TPPsigs are significantly over-represented near both the 5' and 3' ends of genes; this suggests that TPPsigs can be formed through gene-promoter co-option, or intrusion into untranslated regions. However, roughly half of the TPPsigs are located away from genes in the intergenic DNA and thus may be co-opting cryptic promoters of undesignated origin. Furthermore, TPPsigs are unlike other PPsigs and processed genes in the following ways: (i) they do not show a significant tendency to either deposit on or originate from the X chromosome; (ii) only 5% of human TPPsigs have potential orthologs in mouse. This latter finding indicates that the vast majority of TPPsigs is lineage specific. This is likely linked to well-documented extensive lineage-specific SINE/LINE activity. The list of TPPsigs is available at http:pseudogene.org.

Experimental Collaborations

During the past year, we coupled on microarray analysis tools and our gene and pseudogene assignment to aid in specific experimental collaborations associated with the CEGS Center.


Published Work


The results of a number of these collaborations were described in Bertone et al. (2004) and White et al. (2005).



In Gilad et al., we were involved in collaboration to evaluate the effect of sequence divergence on gene expression profiles in multi-species microarrays.

In Grosshans et al., we developed initial methods to find targets of micro RNAs. We are investigating ways of applying these in our intergenic characterization.

We currently have a number of projects ongoing that are not related to any publications.


Tiling array based method for mapping of 5 and 3 ends


We are developing a method to separately enrich for the 5 and 3 ends of mRNA during modified cDNA synthesis and subsequent PCR amplification. We have used the Yale Chromosome 22 PCR amplicon tiling array as well as NimbleGen oligo tiling arrays. The hybridizations of the first two sample pairs were successful and the resulting data is being analyzed. This will add an additional layer of specificity and resolution to transcriptome studies. It will be useful in the context of mapping known and new transcripts and their splice variants.


STAT1 ChIP-chip


We implemented a JAVA program to analyze the ChIP-chip experiment results. The analysis of the STAT1 ChIP-chip datasets using this program revealed a number of binding sites of this transcription factor in the human genome. Various ChIP-chip platforms and experimental designs were also compared by examining both the similarity and the difference of the analysis results.

Core CEGS articles
published in the last year



Issues in the analysis of oligonucleotide tiling microarrays for transcript mapping

Thomas E. Royce, Joel S. Rozowsky, Paul Bertone, Manoj Samanta, Viktor Stolc, Sherman Weissman, Michael Snyder and Mark Gerstein (2005)

TIG (in press)


Integrated pseudogene annotation for human chromosome 22: evidence for transcription.

D Zheng, Z Zhang, PM Harrison, J Karro, N Carriero, M Gerstein (2005) J Mol Biol 349: 27-45.


Transcribed processed pseudogenes in the human genome: an intermediate form of expressed retrosequence lacking protein-coding ability.

PM Harrison, D Zheng, Z Zhang, N Carriero, M Gerstein (2005) Nucleic Acids Res 33: 2374-83.



DNA replication-timing analysis of human chromosome 22 at high resolution and different developmental states.

EJ White, O Emanuelsson, D Scalzo, T Royce, S Kosak, EJ Oakeley, S Weissman, M Gerstein, M Groudine, M Snyder, D Schbeler (2004) Proc Natl Acad Sci U S A 101: 17771-6.


Fast optimal genome tiling with applications to microarray design and homology search.

P Berman, P Bertone, B Dasgupta, M Gerstein, MY Kao, M Snyder (2004) J Comput Biol 11: 766-85.


Global identification of human transcribed sequences with genome tiling arrays.

P Bertone, V Stolc, TE Royce, JS Rozowsky, AE Urban, X Zhu, JL Rinn, W Tongprasit, M Samanta, S Weissman, M Gerstein, M Snyder (2004) Science 306: 2242-6.



Large-scale analysis of pseudogenes in the human genome.

Z Zhang, M Gerstein (2004) Curr Opin Genet Dev 14: 328-35.


Additional papers published in the last year related to CEGS activities


Multi-species microarrays reveal the effect of sequence divergence on gene expression profiles.

Y Gilad, SA Rifkin, P Bertone, M Gerstein, KP White (2005) Genome Res 15: 674-80.


The temporal patterning microRNA let-7 regulates several transcription factors at the larval to adult transition in C. elegans.

H Grosshans, T Johnson, KL Reinert, M Gerstein, FJ Slack (2005) Dev Cell 8: 321-30.


A high productivity/low maintenance approach to high-performance computation for biomedicine: four case studies.

N Carriero, MV Osier, KH Cheung, PL Miller, M Gerstein, H Zhao, B Wu, S Rifkin, J Chang, H Zhang, K White, K Williams, M Schultz (2005) J Am Med Inform Assoc 12: 90-8.