Yale CEGS: Integrated Technologies for Analyzing the Human Genome

PI: Michael Snyder; coPIs: Mark Gerstein, Sherman Weissman and Perry Miller

We have continued developing technologies for analyzing the human genome, focusing on (A) array and (B) sequencing-based approaches.

A1) We have developed a novel method for mapping the 3' ends of messages in which 3' ends of messages are selectively used to probe genomic tiling microarrays. Using this technology we have identified over twice as many 3' ends as previously known un the ENCODE regions.

A2) We have developed array-based approaches for measuring structural variation (SV) in the human genome. SV, which involves large, kilo- to mega-base sized deletions, duplications, insertions, inversions, and complex combinations of rearrangements, is widespread in the genomes of healthy individuals and presumably responsible for a considerable amount of phenotypic variation.

A3) We have developed methods for optimizing a variety of array based technologies -- e.g. in relation to measuring SV, transcription, or DNA binding -- and measuring their limitations. In particular, we have developed methods and tools for analyzing DNA microarray cross hybridization.

A4) We have developed methods for optimally annotating the human genome based the results of array based technologies.

B1) We have also developed methods for ChIP-sequencing based technologies. These include methods for scoring and analyzing data. We are currently investigating integrating array-based and sequencing technologies.

B2) Genome sequencing is expensive and time consuming. We investigated whether 454 technology can be used for de novo genome sequencing by determining the DNA sequence of Acinetobacter baumannii, an important human pathogen. We found that in conjunction with conventional gap filling the genome could be sequenced with 99.92% accuracy and gave only 30 split genes of ~4000 total. Thus, 454 technology is suitable for cost effective de novo genome sequencing.

B3) Finally, we have developed new sequencing based methods for analyzing structural variation (SV) in the genome. To date, cost effective methods for large-scale analysis of SV have identified variants in the order of 50kb or greater, but do not detect smaller SVs or precisely identify SV boundaries (i.e. breakpoints). Further, most studies performed thus far do not detect inversions. We developed a novel approach, high resolution and massive Paired-End Mapping (PEM), which combines rescue and capture of the paired-ends of 3 kb fragments, 454 Sequencing, and a computational approach involving mapping of massive sets of DNA reads onto the human reference genome, to identify SVs 3 kb or larger. The resolution of this method is considerably higher than previous approaches. PEM was used to map SVs in two individuals: An extensive dataset of over 21 million paired-ends (with fragments covering the diploid genome at 4.3 fold) was sequenced to comprehensively map structural variation in an African individual, revealing ~900 SVs relative to the reference genome. Analysis of a European individual for which 10 million paired-ends (2.1X coverage) were sequenced, enabled initial comparison between two individuals from different population backgrounds: ~450 SVs were detected, approximately half (47%) of which are shared with the African individual. ~300 genes were found to be affected by SV, 40 with altered gene (exon) structure. To systematically analyze SV breakpoints we devised a pooling and shotgun library approach that revealed breakpoint junction sequences of 120 SVs, ~10-times more than previously reported. The breakpoint junction sequences were inferred for an additional 85 SVs. Systematic analysis of breakpoint sequences revealed specific classes of SVs and mechanisms of SV formation. The major classes of SV formation were through NHEJ (50%) and retrotransposition of LINE elements (30%) although other retransposition events were also found including that of an endogenous retrovirus. Nonallelic homologous recombination (NAHR) between repetitive elements and high complexity DNA was also observed, including one instance of a gene fusion. Although SVs often reside in genomic regions with segmental duplications, NAHR was found at a surprisingly low frequency even among the larger SVs. Overall we found an unprecedented amount of SV, fine-mapped ~1300 SVs, and provided important insights into mechanisms by which this type of genomic variation has arisen.