H) Informatics

Across the last year in the Yale CEGS (year 7), we have worked on informatics approaches for creating an infrastructure for supporting tiling array and next-generation sequencing technology and relating the results of these technologies to human-genome annotation and structural variation.

1) Infrastructure for supporting Next-generation Sequencing

We are developing a web-accessible database for describing samples and associating such samples with Solexa runs. Single or multiple samples can be assigned to each lane of a flowcell. We use Oracle 10g as the database backend. We use the Java Struts framework for developing the Web interface, which has links to the files generated by different programs (e.g., Firecrest, Bustard, and Gerald) in the Solexa pipeline. We plan to expand the Web interface to allow flexible searches based on sample description. In addition, we have a 10 TB file storage system dedicated to the long-term archival of a subset of the Solexa data and will add more storage as needed.

We have also begun work on simulating and scoring high-throughput sequencing experiments and expect results in the coming years.

2) Infrastructure for supporting Tiling Array Technology

Although we are transitioning to sequencing technology, we have been wrapping up a number of the tiling array projects started during the first phase of the CEGS. Furthermore, in transitioning to sequencing, we are documenting the fundamental limitations of tiling array technology.

2.a) Royce et al. (2007a): Prediction of gene expression through nearest-neighbor probe sequence identification -- toward a universal microarray

A generic DNA microarray design applicable to any species would greatly benefit comparative genomics. We have addressed the feasibility of such a design by leveraging the great feature densities and relatively unbiased nature of genomic tiling microarrays. In particular, we leverage semi-specific cross-hybridization. Semi-specific cross-hybridization occurs when a probe has only a few mismatches with an unintended nucleic acid target. After considering ways to detect and remove artificial signal from this type of cross-hybridization, we realized that we actually may be able to use this phenomenon to our advantage. Specifically, we first divided each Homo sapiens Refseq-derived gene's spliced nucleotide sequence into all of its possible contiguous 25 nt subsequences. For each of these 25 nt subsequences, we searched a recent human transcript mapping experiment's probe design for the 25 nt probe sequence having the fewest mismatches with the subsequence, but that did not match the subsequence exactly. Signal intensities measured with each gene's nearest-neighbor features were subsequently averaged to predict their gene expression levels in each of the experiment's thirty-three hybridizations. We examined the fidelity of this approach in terms of both sensitivity and specificity for detecting actively transcribed genes, for transcriptional consistency between exons of the same gene, and for reproducibility between tiling array designs. Taken together, our results provide proof-of-principle for probing nucleic acid targets with off-target, nearest-neighbor features.

2.b) Royce et al. (2007b): Developing an efficient pseudomedian filter for tiling microrrays

Tiling array experiments are being conducted at increasingly finer resolutions as the microarray technology enjoys increasingly greater feature densities. The increased densities naturally lead to increased data analysis requirements. Specifically, the most widely employed algorithm for tiling array analysis involves smoothing observed signals by computing pseudomedians within sliding windows, a O(n²logn) calculation in each window. This poor time complexity is an issue for tiling array analysis and could prove to be a real bottleneck as tiling microarray experiments become grander in scope and finer in resolution. We therefore implemented Monahan's HLQEST algorithm that reduces the runtime complexity for computing the pseudomedian of n numbers to O(nlogn) from O(n²logn). For a representative tiling microarray dataset, this modification reduced the smoothing procedure's runtime by nearly 90%. We then leveraged the fact that elements within sliding windows remain largely unchanged in overlapping windows (as one slides across genomic space) to further reduce computation by an additional 43%. This was achieved by the application of skip lists to maintaining a sorted list of values from window to window. This sorted list could be maintained with simple O(log n) inserts and deletes. We illustrate the favorable scaling properties of our algorithms with both time complexity analysis and benchmarking on synthetic datasets. Tiling microarray analyses that rely upon a sliding window pseudomedian calculation can require many hours of computation. Because of our new algorithm, we have eased this requirement significantly by implementing efficient algorithms that scale well with genomic feature density. This result not only speeds the current standard analyses, but also makes possible ones where many iterations of the filter may be required, such as might be required in a bootstrap or parameter estimation setting. We make source code and executables available at http://tiling.gersteinlab.org/pseudomedian/

3) Work relating the results of high-throughput sequencing to human genome variation

One of the major things for which we plan to use the new next-generation sequencing is for characterizing human genome variation, principally genome structural variation. Two of the most interesting quantities related to variation are the degree of co-evolution and the degree of adaptive evolution (positive selection). We have been developing approaches for relating structural variation to these quantities and other aspects of human genome annotation (e.g., gene family annotation).

3.a) Kim et al. (2007): Measuring positive selection at the network periphery

We examine the relationship between genetic signatures of adaptive evolution and network topology. We find a striking tendency of proteins that have been under positive selection (as compared with the chimpanzee) to be located at the periphery of the interaction network. Our results are based on the analysis of two types of genome evolution, both in terms of intra- and interspecies variation. First, we looked at single-nucleotide polymorphisms and their fixed variants, single-nucleotide differences in the human genome relative to the chimpanzee. Second, we examine fixed structural variants, specifically large segmental duplications and their polymorphic precursors known as copy number variants. We propose two complementary mechanisms that lead to the observed trends. First, we can rationalize them in terms of constraints imposed by protein structure: We find that positively selected sites are preferentially located on the exposed surface of proteins. Because central network proteins (hubs) are likely to have a larger fraction of their surface involved in interactions, they tend to be constrained and under negative selection. Conversely, we show that the interaction network roughly maps to cellular organization, with the periphery of the network corresponding to the cellular periphery (i.e., extracellular space or cell membrane). This suggests that the observed positive selection at the network periphery may be due to an increase of adaptive events on the cellular periphery responding to changing environments.

3.b) Zhang ZD et al. (2008): Measuring positive selection in a particular human gene (CD4)

We did a targeted study to understand positive selection in detail on a particularly important human protein. CD4, an integral membrane glycoprotein, plays a critical role in the immune response and in the life cycle of simian and human immunodeficiency virus (SIV and HIV). Pairwise comparisons of orthologous human and mouse genes show that CD4 is evolving much faster than the majority of mammalian genes. The acceleration is too great to be attributed to a simple relaxation of the action of purifying selection alone. Here we show that the selective pressure acting on CD4 is highly variable between regions in the protein and identify codon sites under strong positive selection.

3.c.) Yip et al. (2008): An integrated system for studying coevolution

Coevolution is an increasingly important concept in relation to genome variation, particularly as we get larger and larger quantities of data. While a multitude of different functions for quantifying it have been proposed, not much is known about their relative strengths and weaknesses. Also, subtle algorithmic details have discouraged implementing and comparing them. We addressed this issue by developing an integrated online system that enables comparative analyses with a comprehensive set of commonly used scoring functions, including Statistical Coupling Analysis (SCA), Explicit Likelihood of Subset Variation (ELSC), mutual information and correlation-based methods. A set of data preprocessing options are provided for improving the sensitivity and specificity of coevolution signal detection, including sequence weighting, residue grouping and the filtering of sequences, sites and site pairs. A total of more than 100 scoring variations are available. The system is available at http://coevolution.gersteinlab.org. The source code and JavaDoc API can also be downloaded from the web site.

3.d) Korbel et al. (2008): How structural variation relates to gene duplications and families.

CNVs are a particular type of genome structural variation. Although not immediately evident, CNV surveys make a conceptual connection between the fields of population genetics and protein families, in particular with regard to the stability and expandability of families. The mechanisms giving rise to CNVs can be considered as fundamental processes underlying gene duplication and loss; duplicated genes being the results of 'successful' copies, fixed and maintained in the population. Conversely, many 'unsuccessful' duplicates remain in the genome as pseudogenes. We survey studies on CNVs, highlighting issues related to protein families. In particular, we find CNVs tend to affect specific gene functional categories, such as those associated with environmental response, and are depleted in genes related to basic cellular processes. Furthermore, we find CNVs occur more often at the periphery of the protein interaction network. In comparison, protein families associated with successful and unsuccessful duplicates are associated with similar functional categories but are differentially placed in the interaction network. These trends are likely reflective of CNV formation biases and natural selection, both of which differentially influence distinct protein families.

4) Relating the results of tiling array analyses to genome annotation

We have continued work started in the first phase of the CEGS relating the results of tiling array experiments to genome annotation.

4.a) Zheng & Gerstein (2007): The ambiguous boundary between genes and pseudogenes

Pseudogenes have long been considered to be 'dead', nonfunctional by-products of genome evolution. However, several lines of evidence -- many the results of tiling array technology to assay transcription and binding -- now show that some pseudogenes are transcriptionally 'alive', and a few might even have biochemical roles. Therefore, the boundary between genes (often considered to be 'living') and pseudogenes (often considered to be 'dead') might be ambiguous and difficult to define. Here, we examine the evidence for and against pseudogene functionality, and we argue that the time is ripe for revising the definition of a pseudogene. Furthermore, we suggest a classification system to accommodate pseudogenes with various levels of functionality.

Primary Research (Informatics)

PM Kim, JO Korbel, MB Gerstein (2007)

Positive selection at the protein network periphery: evaluation in terms of structural constraints and cellular context.

Proc Natl Acad Sci U S A 104: 20274-9.

Zhang ZD, Weinstock G, Gerstein M.

Rapid evolution by positive darwinian selection in T-cell antigen CD4 in primates.

J Mol Evol. 2008 May;66(5):446-56. Epub 2008 Apr 15.

PMID: 18414925

KY Yip, P Patel, PM Kim, DM Engelman, D McDermott, M Gerstein (2008). An integrated system for studying residue coevolution in proteins.

Bioinformatics 24: 290-2.

TE Royce, JS Rozowsky, MB Gerstein (2007). Toward a universal microarray: prediction of gene expression through nearest-neighbor probe sequence identification.

Nucleic Acids Res 35: e99.

TE Royce, NJ Carriero, MB Gerstein (2007).

An efficient pseudomedian filter for tiling microrrays.

BMC Bioinformatics 8: 186.

D Zheng, MB Gerstein (2007)

The ambiguous boundary between genes and pseudogenes: the dead rise up, or do they?

Trends Genet 23: 219-24.

Reviews

Korbel JO, Kim PM, Chen X, Urban AE, Weissman S, Snyder M, Gerstein MB.

(2008) The current excitement about copy-number variation: how it relates to gene duplications and protein families.

Curr Opin Struct Biol. 2008 May 27. PMID: 18511261

VI. Project Generated Resources

b) All of the technology developed and information acquired from the CEGS grant will be valuable to many researchers. We have already received numerous requests about our paired end sequencing and RNA sequencing.

c) We have created the website tiling.gersteinlab.org .