CEGS Progress Report

Progress Report

...

D) New Sequencing Technologies for Genome Analysis

Our grant application described using new sequencing technologies (454 was the only one available at the time) for mapping TF binding sites. To familiarize ourselves with this technology as well as evaluate its performance, we opted to determine if we could sequence a bacterial genome de novo. Several major sequencing groups indicated that the technology was not suitable for this purpose, but there were reasons to suspect the contrary. Therefore, we decide to sequence the genome of Acinetobacter baumannii. This organism is an important and problematic human pathogen as it is the causative agent of several types of infections including pneumonia, meningitis, septicemia, and urinary tract infections. The genome of A. baumannii was sequenced using ~821K 454 reads of average length 106 bp resulting in 136 contigs. Using paired end reads the number of contigs was reduced to 26 and then standard methods were used for gap closure (PCR plus Sanger sequencing). Excluding the rDNA repeats, the assembled genome is 3,976,746 base pairs (bp) and has 3830 ORFs. The accuracy rate is estimated to be greater than 99.92% and has less than 30 split genes-the most likely form of error. This result is significant for several reasons. First, it demonstrates for the first time that this technology can be used to determine the de novo sequence of a genome with very high accuracy and at substantially lower cost than what is currently performed. Second, it demonstrates that any lab can sequence a simple genome-not just the major centers. Third, the work is of high biological significance as this is a major pathogen that is affecting the troops in Iraq. A number of news agencies have featured this work from both the technical and biological perspective.

(Note most of the funds for this project came from a Burroughs Wellcome grant; some support was provided by the CEGS).

...

F) Informatics

In the last year in the Yale CEGS (year 6), we have worked on informatics approaches for scoring tiling arrays, better characterizing the "hits" found in the experiments, and, broadly, doing intergenic annotation of the human genome.

1. Development of Informatics Approaches for Scoring Tiling Arrays

1.a. Using high-resolution oligo arrays to measure human variation (Korbel et al., 2007)

We published a paper on the detection of the breakpoint of copy-number variants (CNVs) in humans from Comparative Genome Hybridization (CGH) experiments involving oligonucleotide tiling microarrays (Korbel et al., 2007).

Several algorithms were developed over the last years to predict large and/or sub-microscopic cytogenetic aberrations from array-CGH data. Those algorithms are generally accepted to exhibit resolutions of 50-100 kb. A novel field our group became interested in is the charting of copy-number variants (CNVs) in the human genome, i.e. deletions and duplications that are usually sub-microscopic and that are not precisely mapped using existing algorithms. In particular, the breakpoints of almost all CNVs reported in humans are unknown (see e.g. the Database of Genomic Variants; http://projects.tcag.ca/variation). As CNVs are an abundant form of genetic variation, and given some evidence that they may be the cause of much phenotypic variation seen in the human population, we have focused on developing an approach for fine-mapping CNVs. In particular, we statistically integrated both sequence characteristics associated with the few already known breakpoints and data from high-resolution comparative genome hybridization (HighRes-CGH) experiments in a discrete-valued, bivariate Hidden Markov Model. In anticipation of an upcoming increase in CNV data, we developed an iterative, 'active' approach to initially scoring data with a preliminary model, performing targeted validations, retraining the model, and then rescoring, and a flexible parameterization system that intuitively collapses from a full model of 2503 parameters to a core one of only 10.

Using our approach coined 'Break-Pointer' (abbrev. BreakPtr) allowed us to accurately map >400 breakpoints on chromosome 22 and a region of chromosome 11, which refined the boundaries of many previously approximately mapped CNVs. Four predicted breakpoints flanked known disease-associated deletions. Furthermore, we validated an additional four predicted CNV breakpoints by sequencing. Overall, our results suggest a predictive resolution of around 300bp. This level of resolution enables much more precise correlations between CNVs and across individuals than previously possible, allowing us to obtain population frequency estimates. Further, it enabled us to demonstrate a clear Mendelian pattern of inheritance for one of the CNVs.

1.b. Development of Informatics Approaches for Correcting Array Artifacts

In conjunction with the scoring, we have developed approaches for removing a number of artifacts in tiling array experiments. In particular, we have developed a procedure to remove spatial artifacts. We have investigated the presence of cross-hybridization on tiling microarrays. Specifically, we have found it useful to sub-divide the concept of cross-hybridization into two sub-concepts: ubiquitous background cross-hybridization, and semi-specific cross-hybridization.

1.b.i. Development of COP (Yu et al., 2007) -- Correction of Spatial Artifacts

Our study has shown that there are two types of positional artifacts in microarray data introducing spurious correlations between genes. First, we find that genes that are close on the microarray chips tend to have higher correlations between their expression profiles. We call this the "chip artifact". Our calculations suggest that the carry-over during the printing process is one of the major sources of this type of artifact, which is later confirmed by our experiments. Based on our experiments, the measured intensity of a microarray spot contains 0.1% (for fully-hybridized spots) to 93% (for un-hybridized ones) of noise resulting from this artifact. Secondly, we, for the first time, show that genes that are close on the microtiter plates in microarray experiments also tend to have higher correlations. We call this the "plate artifact". Both types of artifacts exist with different severity in all cDNA microarray experiments that we analyzed. Based on our analysis, we have developed an automated web tool - COP (COrrelations by Positional artifacts) to detect these artifacts in microarray experiments. COP has been integrated with the microarray data normalization tool, ExpressYourself, which is available at http://bioinfo.mbb.yale.edu/ExpressYourself/. Together, the two can eliminate most of the common noises in microarray data.

1.b.ii. Royce et al. (2007) -- Correction for ubiquitous background cross-hybridization on arrays

We have found that background cross-hybridization can be modeled fairly well as a function of probe sequence composition and subsequently identified that array intensities are highly sequence dependent and can greatly influence downstream results. We therefore developed three metrics for assessing this sequence dependence and have used them in evaluating strategies that aim to remove these biases. We applied three techniques for addressing this problem; one method, adapted from similar work on GeneChip brand microarrays, is based on modeling array signal as a linear function of probe sequence, the second method extends this approach by iterative weighting and re-fitting of the model, and the third technique extrapolates the popular quantile normalization algorithm for between-array normalization to probe sequence space. All methods reduce background cross-hybridization but we have found that the quantile-based method yields favorable results.

2. Development of Informatics Approaches for Better Characterizing the Active Regions Found in Tiling Array Experiments

2.a. Zhang et al. (2007) -- Better Characterizing TARs from Array Experiments

We have worked on approaches for better characterizing transcriptionally active regions in the genome.

Widespread transcription activities in the human genome were recently observed in high-resolution tiling array experiments, which revealed many novel transcripts that are outside of the boundaries of known protein or RNA genes. Termed as "TARs" (Transcriptionally Active Regions), these novel transcribed regions represent "dark matter" in the genome, and their origin and functionality need to be explained. Many of these transcripts are thought to code for novel proteins or non-protein-coding RNAs. We have applied an integrated bioinformatics approach to investigate the properties of these TARs, including cross-species conservation, and the ability to form stable secondary structures. The goal of this study is to identify a list of potential candidate sequences that are likely to code for functional non-protein-coding RNAs. We are particularly interested in the discovery of those functional RNA candidates that are primate-specific, i.e. those that do not have homologs in the mouse or dog genomes but in rhesus. In particular, using sequence conservation and the probability of forming stable secondary structures, we have identified ~300 possible candidates for primate-specific noncoding RNAs.

2.b. Yu et al. (2006) -- Developing Approaches for Analyzing Regulatory Networks Derived from ChIP-chip experiments (Partially funded by CEGS)

ChIP-chip experiments give rise to regulatory networks. We have developed approaches for conceptualizing these networks in terms of hierarchies. In particular, the relationships between TFs and their target genes can be modeled in terms of directed regulatory networks. These, in turn, can be readily compared to commonplace "chain of command" structures in social networks, which have a characteristic hierarchical layout. To better understand the structure of these regulatory networks, we have developed an algorithmic approach for identifying generalized hierarchies (allowing for various regulatory loops) and used this to show that clear pyramid-shaped hierarchical structures exist in the regulatory networks of representative prokaryotes and eukaryotes, with most TFs at the bottom levels and only a few master TFs on the top. These masters receive most of the input signals of the whole network through interactions with other proteins. They are situated near the center the protein-protein interaction network -- a different type of network from the regulatory one. The master TFs have maximal influence over other genes. However, surprisingly TFs at the bottom of the regulatory hierarchy are more essential to the viability of the cell. Moreover, one might think that master TFs achieve their wide influence through directly regulating many targets, but actually TFs with most direct targets are in the middle of the hierarchy. We find, in fact, that these middle-level TFs act as bottlenecks of the hierarchy. This large amount of control for "middle managers" has parallels in structures found to be efficient in various corporate and governmental settings.

3. Developing Approaches for Annotation of the Intergenic Regions of the Human Genome -- Pseudogene Annotation

Last year, in the area of pseudogene analysis we focused on comparison of existing pseudogene identification methods from different research groups and development of tools and a database for integrating pseudogene annotations from these methods. We also started to explore the biological implication of pseudogene transcription and the structural and functional implication of pseudogenes in the human genome.

3.a. Pseudogene.org Database (Karro et al., 2007)

We have developed the pseudogene.org knowledgebase as a comprehensive repository for pseudogene annotation, and details of our work was described in Karro et al. (2007). Unlike protein coding genes or other functional genes that can often be experimentally tested, pseudogenes are one type of genomic components whose identification depends almost exclusively on computational analysis. However, the definition of a pseudogene varies within the literature, resulting in significantly different approaches to the problem of identification. Besides the challenge of putting data from different approaches into a common reference frame, it is also difficult to maintain a consistent collection of pseudogenes in detail necessary for their effective use. In order to address these issues, we design and develop the pseudogene.org database. This comprehensive database integrates a variety of heterogeneous resources and supports a subset structure that highlights specific groups of pseudogenes that are of interest to the research community. Tools are provided for the comparison of sets and the creation of layered set unions, enabling researchers to derive a current 'consensus' set of pseudogenes. Additional features include versatile search, the capacity for robust interaction with other databases, the ability to reconstruct older versions of the database (accounting for changing genome builds) and an underlying object-oriented interface designed for researchers with a minimal knowledge of programming. In addition to human pseudogenes, we have also started to apply our computational pipeline to identify pseudogenes in the chimp, mouse, rat, other eukaryotic genomes, and many prokaryotic genomes. At the present time, the database contains more than 100,000 pseudogenes spanning 64 prokaryote and 11 eukaryote genomes, including a collection of human annotations compiled from 16 sources.

3.b. Article in Scientific American on Pseudogenes (Gerstein & Zheng, 2006)

In 2005, we discovered that a significant percentage of annotated human pseudogenes showed various degrees of transcriptional activity. In 2006, we started to conceptualize the biological implication of this phenomenon of pseudogene transcription. Furthermore, we have also begun to explore the structural and functional implication of human pseudogenes in the human genome. The major concepts were described in a 2006 Scientific America article, "the Real Life of Pseudogenes", by Mark Gerstein and Deyou Zheng. As an article intended for general publics, we gave a very general overview of current efforts and knowledge on annotating and characterizing pseudogenes. We also laid out the status and problem of pseudogene annotation, and summarized the potential application of pseudogenes in studying gene and genome evolution. Furthermore, we gave our perspective view on how pseudogene transcription and pseudogenes in general may be important pieces of information that can help us decode the human genome. In particular, we proposed that transcribed pseudogenes could be a novel source of non-coding RNAs while pointed out some pseudogene transcription might be a result of stochastic biological activity. In light of that current pseudogene identification focuses on identification of "recent" pseudogenes, we argued that the improvement of computational methods might discover old but persistent pseudogenes, which can shed light on new aspect of pseudogenes in affecting the evolution, structure and function of the human genome and other mammalian genomes.

III) Significance

We belive all of these project will be high impact. Of those recently published or about to be published As indicated above the first de novo genome sequencing of Acinetobacter baumannii is of very high impact from the technology standpoint and the organism that was sequenced. It has been feature in many newspaper and magazines.

V) Publications

Smith MG, Gianoulis TA, Pukatzki S, Mekalanos JJ, Ornston LN, Gerstein M, Snyder M. (2007). New insights into Acinetobacter baumannii pathogenesis revealed by high-density pyrosequencing and transposon mutagenesis. Genes Dev. 2007 21:601-14.* Featured in many newpapers and magazines including the Washington Post.

Jan O. Korbel, Alexander Eckehart Urban, Fabian Grubert, Jiang Du, Thomas E. Royce, Peter Starr, Guoneng Zhong, Beverly S. Emanuel, Sherman M. Weissman, Michael Snyder & Mark B. Gerstein (2007). Systematic prediction and validation of breakpoints associated with copy-number variants in the human genome. PNAS, in press

TE Royce, JS Rozowsky, MB Gerstein (2007) Assessing the need for sequence-based normalization in tiling microarray experiments. Bioinformatics 23: 988-97.

Z Zhang, AW Pang, M Gerstein (2007). Comparative analysis of genome tiling array data reveals many novel primate-specific functional RNAs in human. BMC Evol Biol 7 Suppl 1: S14.

H Yu, K Nguyen, T Royce, J Qian, K Nelson, M Snyder, M Gerstein (2007). Positional artifacts in microarrays: experimental verification and construction of COP, an automated detection tool. Nucleic Acids Res 35: e8.

M Gerstein, D Zheng (2006) The real life of pseudogenes. Sci Am 295: 48-55.

JE Karro, Y Yan, D Zheng, Z Zhang, N Carriero, P Cayting, P Harrrison, M Gerstein (2007). Pseudogene.org: a comprehensive database and comparison platform for pseudogene annotation. Nucleic Acids Res 35: D55-60.

H Yu, M Gerstein (2006). Genomic analysis of the hierarchical structure of regulatory networks. Proc Natl Acad Sci U S A 103: 14724-31.