RNA Analyses and RNA Seq


 b) Invention of RNA Seq and Analysis of the Yeast Genome

 The identification of untranslated regions, introns, and coding regions within an organism remains challenging. We developed a quantitative sequencing-based method called RNA-Seq for mapping transcribed regions, in which complementary DNA fragments are subjected to high-throughput sequencing and mapped to the genome. We applied RNA-Seq to generate a high-resolution transcriptome map of the yeast genome and demonstrated that most (74.5%) of the nonrepetitive sequence of the yeast genome is transcribed. We confirmed many known and predicted introns and demonstrated that others are not actively used. Alternative initiation codons and upstream open reading frames also were identified for many yeast genes. We also found unexpected 3'-end heterogeneity and the presence of many overlapping genes. These results indicate that the yeast transcriptome is more complex than previously appreciated. We further demonstrated that RNA-Seq is highly accurate for measurements of RNA level—much more so than DNA microarrays.


c) Development of Paired End RNA-Seq and Transcript Isoform Analysis

RNA Seq helps defines transcribed regions and some intron exons structure, but the latter is limited.  To attempt to broaden our capabilities we developed paired end RNA-Sequencing which along with long RNA-Seq provides more information on isoform structure. These studies were applied to the analysis of hESCs and their differentiation into neurons. Undifferentiated hESCs as well as cells at three stages of early neural differentiation-N1 (early initiation), N2 (neural progenitor), and N3 (early glial-like)-were analyzed using a combination of single read, paired-end read, and long read RNA sequencing. The results revealed enormous complexity in gene transcription and splicing dynamics during neural cell differentiation. We found previously unannotated transcripts and spliced isoforms specific for each stage of differentiation. Interestingly, splicing isoform diversity is highest in undifferentiated hESCs and decreases upon differentiation, a phenomenon we call Ňisoform specializationÓ. During neural differentiation, we observed differential expression of many types of genes, including those involved in key signaling pathways, and a large number of extracellular receptors exhibit stage-specific regulation. These results provide a valuable resource for studying neural differentiation and reveal insights into the mechanisms underlying in vitro neural differentiation of hESCs, such as neural fate specification, neural progenitor cell identity maintenance, and the transition from a predominantly neuronal state into one with increased gliogenic potential.



Short-read high-throughput DNA sequencing technologies provide new tools to answer biological questions. However, high cost and low throughput limit their widespread use, particularly in organisms with smaller genomes such as S. cerevisiae. Although ChIP-Seq in mammalian cell lines is replacing array-based ChIP-chip as the standard for transcription factor binding studies, ChIP-Seq in yeast is still underutilized compared to ChIP-chip. We developed a multiplex barcoding system that allows simultaneous sequencing and analysis of multiple samples using Illumina's platform. We applied this method to analyze the chromosomal distributions of three yeast DNA binding proteins (Ste12, Cse4 and RNA PolII) and a reference sample (input DNA) in a single experiment and demonstrate its utility for rapid and accurate results at reduced costs. We have now applied this methodology to a number of organisms (e.g. C. elegans) and have helped many labs with this approach.




Analysis of Human Variation


a) Development of Paired End Mapping

Structural variation of the genome involves kilobase- to megabase-sized deletions, duplications, insertions, inversions, and complex combinations of rearrangements. We introduce high-throughput and massive paired-end mapping (PEM), a large-scale genome-sequencing method to identify structural variants (SVs) approximately 3 kilobases (kb) or larger that combines the rescue and capture of paired ends of 3-kb fragments, massive 454 sequencing, and a computational approach to map DNA reads onto a reference genome. PEM was used to map SVs in an African and in a putatively European individual and identified shared and divergent SVs relative to the reference genome. Overall, we fine-mapped more than 1000 SVs and documented that the number of SVs among humans is much larger than initially hypothesized; many of the SVs potentially affect gene function. The breakpoint junction sequences of more than 200 SVs were determined with a novel pooling strategy and computational analysis. Our analysis provided insights into the mechanisms of SV formation in humans.


b) Variation in Transcription Factor Binding Among Humans

Most genetic variations lie outside the coding sequences of known genes and there effect on genome function is not known. To understand their role in transcription factor binding, we examined genome-wide differences in transcription factor (TF) binding in several humans and a single chimpanzee using ChIP-Seq and correlated these differences with genetic changes.  The binding sites of RNA polymerase II (PolII) and a key regulator of immune responses, nuclear factor kappaB (p65), were mapped in 10 lymphoblastoid cell lines, and 25 and 7.5% of the respective binding regions were found to differ between individuals. Binding differences were frequently associated with single-nucleotide polymorphisms and genomic structural variants, and these differences were often correlated with differences in gene expression, suggesting functional consequences of binding variation. Furthermore, comparing Pol II binding between humans and chimpanzee suggests extensive divergence in TF binding. Our results indicate that many differences in individuals and species occur at the level of TF binding, and they provide insight into the genetic events responsible for these differences.






Since the last progress report, we have developed tools for the analysis of structural variants, developed approaches to comparing sequencing and arrays and developed approaches to analyze the hierarchical structure of regulatory networks.


a) Structural Variants. Lam et al. ('10)


Structural variants (SVs) are a major source of human genomic variation; however, characterizing them at nucleotide resolution remains challenging. We assemble a library of breakpoints at nucleotide resolution from collating and standardizing ~2,000 published SVs. For each breakpoint, we infer its ancestral state (through comparison to primate genomes) and its mechanism of formation (e.g., nonallelic homologous recombination, NAHR). We characterize breakpoint sequences with respect to genomic landmarks, chromosomal location, sequence motifs and physical properties, finding that the occurrence of insertions and deletions is more balanced than previously reported and that NAHR-formed breakpoints are associated with relatively rigid, stable DNA helices. Finally, we demonstrate an approach, BreakSeq, for scanning the reads from short-read sequenced genomes against our breakpoint library to accurately identify previously overlooked SVs, which we then validate by PCR. As new data become available, we expect our BreakSeq approach will become more sensitive and facilitate rapid SV genotyping of personal genomes.


b) Sequencing v arrays (Sasidharan et al., '09)


There are two main technologies for transcriptome profiling, namely, tiling microarrays and high-throughput sequencing. Recently there has been a tremendous amount of excitement about the latter because of the advent of next-generation sequencing technologies and its promises. Consequently, the question of the moment is how these two technologies compare. Here we attempt to develop an approach to do a fair comparison of transcripts identified from tiling microarray and MPSS sequencing data. This comparison is a challenging task because the sequencing data is discrete while the tiling array data is continuous. We use the published rice and Arabidopsis datasets which provide currently best matched sets of arrays and sequencing experiments using a slightly earlier generation of sequencing, the MPSS tag sequencing technology. After scoring the arrays consistently in both the organisms, a first pass comparison reveals a surprisingly small overlap in transcripts of 22% and 66% respectively, in rice and Arabidopsis. However, when we do the analysis in detail, we find that this is an underestimate. In particular, when we map the probe intensities onto the sequencing tags and then look at their intensity distribution, we see that they are very similar to exons. Furthermore, restricting our comparison to only protein-coding gene loci revealed a very good overlap between the two technologies. CONCLUSION: Our approach to compare genome tiling microarray and MPSS sequencing data suggests that there is actually a reasonable overlap in transcripts identified by the two technologies. This overlap is distorted by the scoring and thresholding in the tiling array scoring procedure.


c) Regulatory Network Structure Analysis (Bhardwaj et al., '10)


Gene regulatory networks have been shown to share some common aspects with commonplace governance structures. Thus, we can get some intuition into their organization by arranging them into well-known hierarchical layouts. These hierarchies, in turn, can be placed between the extremes of autocracies, with well-defined levels and clear chains of command, and democracies, without such defined levels and with more co-regulatory partnerships between regulators. In general, the presence of partnerships decreases the variation in information flow amongst nodes within a level, more evenly distributing stress. We study various regulatory networks for five diverse species, Escherichia coli to human. We specify three levels of regulators-top, middle, and bottom-which collectively govern the non-regulator targets lying in the lowest fourth level. We define quantities for nodes, levels, and entire networks that measure their degree of collaboration and autocratic vs. democratic character. We show individual regulators have a range of partnership tendencies: Some regulate their targets in combination with other regulators in local instantiations of democratic structure, whereas others regulate mostly in isolation, in more autocratic fashion. Overall, we show that in all networks studied the middle level has the highest collaborative propensity and coregulatory partnerships occur most frequently amongst midlevel regulators. There is, however, one notable difference between networks in different species: The amount of collaborative regulation and democratic character increases markedly with overall genomic complexity.




Camarena L, Bruno V, Euskirchen G, Poggio S, Snyder M. Molecular mechanisms of ethanol-induced pathogenesis revealed by RNA-sequencing. PLoS Pathog. 2010 Apr 1;6(4):e1000834.


Kasowski M, Grubert F, Heffelfinger C, Hariharan M, Asabere A, Waszak SM, Habegger L, Rozowsky J, Shi M, Urban AE, Hong MY, Karczewski KJ, Huber W, Weissman SM, Gerstein MB, Korbel JO, Snyder M. Variation in transcription factor binding among humans. Science. 2010 Apr 9;328(5975):232-5. Epub 2010 Mar 18.


Wu JQ, Habegger L, Noisa P, Szekely A, Qiu C, Hutchison S, Raha D, Egholm M, Lin H, Weissman S, Cui W, Gerstein M, Snyder M. Dynamic transcriptomes during neural differentiation of human embryonic stem cells revealed by short, long, and paired-end sequencing. Proc Natl Acad Sci U S A. 2010 Mar 16;107(11):5254-9. Epub 2010 Mar 1.


Canaan A, Haviv I, Urban AE, Schulz VP, Hartman S, Zhang Z, Palejev D, Deisseroth AB, Lacy J, Snyder M, Gerstein M, Weissman SM. EBNA1 regulates cellular gene expression by binding cellular promoters. Proc Natl Acad Sci U S A. 2009 Dec 29;106(52):22421-6. Epub 2009 Dec 22.


Snyder M, Du J, Gerstein M. Personal genome sequencing: current approaches and challenges. Genes Dev. 2010 Mar 1;24(5):423-31.


Nagalakshmi U, Waern K, Snyder M. RNA-Seq: a method for comprehensive transcriptome analysis. Curr Protoc Mol Biol. 2010 Jan;Chapter 4:Unit 4.11.1-13.


Korbel JO, Tirosh-Wagner T, Urban AE, Chen XN, Kasowski M, Dai L, Grubert F, Erdman C, Gao MC, Lange K, Sobel EM, Barlow GM, Aylsworth AS, Carpenter NJ, Clark RD, Cohen MY, Doran E, Falik-Zaccai T, Lewin SO, Lott IT, McGillivray BC, Moeschler JB, Pettenati MJ, Pueschel SM, Rao KW, Shaffer LG, Shohat M, Van Riper AJ, Warburton D, Weissman S, Gerstein MB, Snyder M, Korenberg JR. The genetic architecture of Down syndrome phenotypes revealed by high-resolution analysis of human segmental trisomies. Proc Natl Acad Sci U S A. 2009 Jul 21;106(29):12031-6. Epub 2009 Jul 13.


HY Lam, XJ Mu, AM StŸtz, A Tanzer, PD Cayting, M Snyder, PM Kim, JO Korbel, MB Gerstein (2010) Nucleotide-resolution analysis of structural variants using BreakSeq and a breakpoint library. Nat Biotechnol 28: 47-55.


C Cheng, N Bhardwaj, M Gerstein (2009) The relationship between the evolution of microRNA targets and the length of their UTRs. BMC Genomics 10: 431.


Y Xia, EA Franzosa, MB Gerstein (2009) Integrated assessment of genomic correlates of protein evolutionary rate. PLoS Comput Biol 5: e1000413.


M Snyder, S Weissman, M Gerstein (2009) Personal phenotypes to go with personal genomes. Mol Syst Biol 5: 273.


TA Gianoulis, J Raes, PV Patel, R Bjornson, JO Korbel, I Letunic, T Yamada, A Paccanaro, LJ Jensen, M Snyder, P Bork, MB Gerstein (2009) Quantifying environmental adaptation of metabolic pathways in metagenomics. Proc Natl Acad Sci U S A 106: 1374-9.


P Lefranois, GM Euskirchen, RK Auerbach, J Rozowsky, T Gibson, CM Yellman, M Gerstein, M Snyder (2009) Efficient yeast ChIP-Seq using multiplex short-read DNA sequencing. BMC Genomics 10: 37.


X Zhang, Z Lian, C Padden, MB Gerstein, J Rozowsky, M Snyder, TR Gingeras, P Kapranov, SM Weissman, PE Newburger (2009) A myelopoiesis-associated regulatory intergenic noncoding RNA transcript within the human HOXA cluster. Blood 113: 2526-34.


LY Wang, A Abyzov, JO Korbel, M Snyder, M Gerstein (2009) MSB: a mean-shift-based approach for the analysis of structural variation in the genome. Genome Res 19: 106-17.


Z Wang, M Gerstein, M Snyder (2009) RNA-Seq: a revolutionary tool for transcriptomics.

 Nat Rev Genet 10: 57-63.


Nucleotide-resolution analysis of structural variants using BreakSeq and a breakpoint library.

HY Lam, XJ Mu, AM StŸtz, A Tanzer, PD Cayting, M Snyder, PM Kim, JO Korbel, MB Gerstein (2010) Nat Biotechnol 28: 47-55.

An approach to comparing tiling array and high throughput sequencing technologies for genomic transcript mapping.

R Sasidharan, A Agarwal, J Rozowsky, M Gerstein (2009) BMC Res Notes 2: 150.


Analysis of diverse regulatory networks in a hierarchical context shows consistent tendencies for collaboration in the middle levels.

N Bhardwaj, KK Yan, MB Gerstein (2010) Proc Natl Acad Sci U S A