RNA Analyses and RNA Seq
b)
Invention of RNA Seq and Analysis of the Yeast Genome
The identification of untranslated
regions, introns, and coding regions within an
organism remains challenging. We developed a quantitative sequencing-based
method called RNA-Seq for mapping transcribed
regions, in which complementary DNA fragments are subjected to high-throughput
sequencing and mapped to the genome. We applied RNA-Seq
to generate a high-resolution transcriptome map of
the yeast genome and demonstrated that most (74.5%) of the nonrepetitive
sequence of the yeast genome is transcribed. We confirmed many known and
predicted introns and demonstrated that others are
not actively used. Alternative initiation codons and
upstream open reading frames also were identified for many yeast genes. We also
found unexpected 3'-end heterogeneity and the presence of many overlapping
genes. These results indicate that the yeast transcriptome
is more complex than previously appreciated. We further demonstrated that RNA-Seq is highly accurate for measurements of RNA level—much
more so than DNA microarrays.
c) Development of Paired End RNA-Seq
and Transcript Isoform Analysis
RNA Seq
helps defines transcribed regions and some intron exons structure, but the latter is limited. To attempt to broaden our capabilities
we developed paired end RNA-Sequencing which along with long RNA-Seq provides more information on isoform
structure. These studies were applied to the analysis of hESCs
and their differentiation into neurons. Undifferentiated hESCs
as well as cells at three stages of early neural differentiation-N1 (early
initiation), N2 (neural progenitor), and N3 (early glial-like)-were
analyzed using a combination of single read, paired-end read, and long read RNA
sequencing. The results revealed enormous complexity in gene transcription and
splicing dynamics during neural cell differentiation. We found previously unannotated transcripts and spliced isoforms
specific for each stage of differentiation. Interestingly, splicing isoform diversity is highest in undifferentiated hESCs and decreases upon differentiation, a phenomenon we
call isoform specialization. During neural
differentiation, we observed differential expression of many types of genes,
including those involved in key signaling pathways, and a large number of
extracellular receptors exhibit stage-specific regulation. These results
provide a valuable resource for studying neural differentiation and reveal
insights into the mechanisms underlying in vitro neural differentiation of hESCs, such as neural fate specification, neural progenitor
cell identity maintenance, and the transition from a predominantly neuronal
state into one with increased gliogenic potential.
ChIP-Seq
Short-read high-throughput
DNA sequencing technologies provide new tools to answer biological questions.
However, high cost and low throughput limit their widespread use, particularly
in organisms with smaller genomes such as S. cerevisiae.
Although ChIP-Seq in mammalian cell lines is
replacing array-based ChIP-chip as the standard for
transcription factor binding studies, ChIP-Seq in
yeast is still underutilized compared to ChIP-chip.
We developed a multiplex barcoding system that allows
simultaneous sequencing and analysis of multiple samples using Illumina's platform. We applied this method to analyze the
chromosomal distributions of three yeast DNA binding proteins (Ste12, Cse4 and
RNA PolII) and a reference sample (input DNA) in a
single experiment and demonstrate its utility for rapid and accurate results at
reduced costs. We have now applied this methodology to a number of organisms
(e.g. C. elegans)
and have helped many labs with this approach.
...
Analysis of Human Variation
a) Development of Paired End Mapping
Structural variation of the
genome involves kilobase- to megabase-sized
deletions, duplications, insertions, inversions, and complex combinations of
rearrangements. We introduce high-throughput and massive paired-end mapping
(PEM), a large-scale genome-sequencing method to identify structural variants (SVs) approximately 3 kilobases
(kb) or larger that combines the rescue and capture of paired ends of 3-kb
fragments, massive 454 sequencing, and a computational approach to map DNA
reads onto a reference genome. PEM was used to map SVs
in an African and in a putatively European individual and identified shared and
divergent SVs relative to the reference genome.
Overall, we fine-mapped more than 1000 SVs and
documented that the number of SVs among humans is
much larger than initially hypothesized; many of the SVs
potentially affect gene function. The breakpoint junction sequences of more
than 200 SVs were determined with a novel pooling
strategy and computational analysis. Our analysis provided insights into the
mechanisms of SV formation in humans.
b) Variation in Transcription Factor Binding Among
Humans
Most genetic variations lie
outside the coding sequences of known genes and there
effect on genome function is not known. To understand their role in
transcription factor binding, we examined genome-wide differences in
transcription factor (TF) binding in several humans and a single chimpanzee
using ChIP-Seq and correlated these differences with
genetic changes. The binding sites
of RNA polymerase II (PolII) and a key regulator of
immune responses, nuclear factor kappaB (p65), were
mapped in 10 lymphoblastoid cell lines, and 25 and
7.5% of the respective binding regions were found to differ between
individuals. Binding differences were frequently associated with
single-nucleotide polymorphisms and genomic structural variants, and these
differences were often correlated with differences in gene expression,
suggesting functional consequences of binding variation. Furthermore, comparing
Pol II binding between humans and chimpanzee suggests
extensive divergence in TF binding. Our results indicate that many differences
in individuals and species occur at the level of TF binding, and they provide
insight into the genetic events responsible for these differences.
...
Informatics
Since the last progress
report, we have developed tools for the analysis of structural variants,
developed approaches to comparing sequencing and arrays and developed
approaches to analyze the hierarchical structure of regulatory networks.
a) Structural Variants. Lam
et al. ('10)
Structural variants (SVs) are a
major source of human genomic variation; however, characterizing them at
nucleotide resolution remains challenging. We assemble a library of breakpoints
at nucleotide resolution from collating and standardizing ~2,000 published SVs. For each breakpoint, we infer its ancestral state
(through comparison to primate genomes) and its mechanism of formation (e.g., nonallelic homologous recombination, NAHR). We characterize
breakpoint sequences with respect to genomic landmarks, chromosomal location,
sequence motifs and physical properties, finding that the occurrence of
insertions and deletions is more balanced than previously reported and that
NAHR-formed breakpoints are associated with relatively rigid, stable DNA
helices. Finally, we demonstrate an approach, BreakSeq,
for scanning the reads from short-read sequenced genomes against our breakpoint
library to accurately identify previously overlooked SVs,
which we then validate by PCR. As new data become available, we expect our BreakSeq approach will become more sensitive and facilitate
rapid SV genotyping of personal genomes.
b) Sequencing v arrays (Sasidharan et al., '09)
There are two main technologies for transcriptome
profiling, namely, tiling microarrays and high-throughput sequencing. Recently
there has been a tremendous amount of excitement about the latter because of
the advent of next-generation sequencing technologies and its promises.
Consequently, the question of the moment is how these two technologies compare.
Here we attempt to develop an approach to do a fair comparison of transcripts
identified from tiling microarray and MPSS sequencing data. This comparison is
a challenging task because the sequencing data is discrete while the tiling
array data is continuous. We use the published rice and Arabidopsis datasets
which provide currently best matched sets of arrays and sequencing experiments
using a slightly earlier generation of sequencing, the MPSS tag sequencing
technology. After scoring the arrays consistently in both the organisms, a
first pass comparison reveals a surprisingly small overlap in transcripts of
22% and 66% respectively, in rice and Arabidopsis. However, when we do the
analysis in detail, we find that this is an underestimate. In particular, when
we map the probe intensities onto the sequencing tags and then look at their
intensity distribution, we see that they are very similar to exons. Furthermore, restricting our comparison to only
protein-coding gene loci revealed a very good overlap between the two
technologies. CONCLUSION: Our approach to compare genome tiling microarray and
MPSS sequencing data suggests that there is actually a reasonable overlap in
transcripts identified by the two technologies. This overlap
is distorted by the scoring and thresholding in the
tiling array scoring procedure.
c) Regulatory Network
Structure Analysis (Bhardwaj et al., '10)
Gene
regulatory networks have been shown to share some common aspects with
commonplace governance structures. Thus, we can get some intuition into their
organization by arranging them into well-known hierarchical layouts. These
hierarchies, in turn, can be placed between the extremes of autocracies, with
well-defined levels and clear chains of command, and democracies, without such
defined levels and with more co-regulatory partnerships between regulators. In
general, the presence of partnerships decreases the variation in information
flow amongst nodes within a level, more evenly distributing stress. We study
various regulatory networks for five diverse species, Escherichia coli to
human. We specify three levels of regulators-top, middle, and bottom-which
collectively govern the non-regulator targets lying in the lowest fourth level.
We define quantities for nodes, levels, and entire networks that measure their
degree of collaboration and autocratic vs. democratic character. We show
individual regulators have a range of partnership tendencies: Some regulate
their targets in combination with other regulators in local instantiations of
democratic structure, whereas others regulate mostly in isolation, in more
autocratic fashion. Overall, we show that in all networks studied the middle
level has the highest collaborative propensity and coregulatory
partnerships occur most frequently amongst midlevel regulators. There is,
however, one notable difference between networks in different species: The
amount of collaborative regulation and democratic character increases markedly
with overall genomic complexity.
Publications
Camarena L, Bruno V, Euskirchen
G, Poggio S, Snyder M. Molecular mechanisms of
ethanol-induced pathogenesis revealed by RNA-sequencing. PLoS Pathog. 2010 Apr 1;6(4):e1000834.
Kasowski M, Grubert F, Heffelfinger C, Hariharan M, Asabere A, Waszak SM, Habegger L, Rozowsky J, Shi M,
Urban AE, Hong MY, Karczewski KJ, Huber W, Weissman SM, Gerstein MB, Korbel
JO, Snyder M. Variation in
transcription factor binding among humans. Science. 2010 Apr 9;328(5975):232-5. Epub 2010 Mar 18.
Wu JQ, Habegger L, Noisa P, Szekely A, Qiu C, Hutchison S, Raha D, Egholm M, Lin H, Weissman S, Cui
W, Gerstein M, Snyder M. Dynamic transcriptomes during neural differentiation of human
embryonic stem cells revealed by short, long, and paired-end sequencing.
Proc Natl Acad
Sci U S A. 2010 Mar 16;107(11):5254-9. Epub 2010 Mar 1.
Canaan A, Haviv I, Urban AE,
Schulz VP, Hartman S, Zhang Z, Palejev D, Deisseroth AB, Lacy J, Snyder M, Gerstein M, Weissman SM. EBNA1 regulates cellular gene
expression by binding cellular promoters. Proc Natl Acad Sci
U S A. 2009 Dec 29;106(52):22421-6.
Epub 2009
Dec 22.
Snyder M, Du J, Gerstein M. Personal genome sequencing: current
approaches and challenges. Genes Dev. 2010 Mar 1;24(5):423-31.
Nagalakshmi U, Waern
K, Snyder M. RNA-Seq: a
method for comprehensive transcriptome analysis.
Curr Protoc Mol Biol. 2010 Jan;Chapter 4:Unit 4.11.1-13.
Korbel JO, Tirosh-Wagner
T, Urban AE, Chen XN, Kasowski M, Dai L, Grubert F, Erdman C, Gao MC,
Lange K, Sobel EM, Barlow GM, Aylsworth
AS, Carpenter NJ, Clark RD, Cohen MY, Doran E, Falik-Zaccai
T, Lewin SO, Lott IT, McGillivray BC, Moeschler JB, Pettenati MJ, Pueschel SM, Rao KW, Shaffer LG, Shohat M, Van Riper AJ, Warburton D, Weissman
S, Gerstein MB, Snyder M, Korenberg JR. The genetic architecture of Down
syndrome phenotypes revealed by high-resolution analysis of human segmental trisomies. Proc Natl Acad Sci
U S A. 2009 Jul 21;106(29):12031-6.
Epub 2009
Jul 13.
HY
Lam, XJ Mu, AM Sttz, A Tanzer,
PD Cayting, M Snyder, PM Kim, JO Korbel,
MB Gerstein (2010) Nucleotide-resolution analysis of structural variants
using BreakSeq and a breakpoint library.
Nat Biotechnol 28: 47-55.
C
Cheng, N Bhardwaj, M Gerstein (2009) The relationship between the evolution
of microRNA targets and the length of their UTRs. BMC Genomics 10: 431.
Y
Xia, EA Franzosa, MB Gerstein (2009) Integrated
assessment of genomic correlates of protein evolutionary rate. PLoS Comput Biol
5: e1000413.
M
Snyder, S Weissman, M Gerstein (2009) Personal phenotypes to go with personal genomes. Mol Syst Biol 5: 273.
TA
Gianoulis, J Raes, PV
Patel, R Bjornson, JO Korbel, I Letunic,
T Yamada, A Paccanaro, LJ Jensen, M Snyder, P Bork,
MB Gerstein (2009) Quantifying environmental adaptation of metabolic pathways in metagenomics. Proc Natl
Acad Sci U S A 106: 1374-9.
P Lefranois, GM Euskirchen, RK Auerbach, J Rozowsky, T Gibson,
CM Yellman, M Gerstein, M Snyder (2009) Efficient yeast ChIP-Seq using multiplex short-read DNA sequencing.
BMC Genomics 10: 37.
X
Zhang, Z Lian, C Padden, MB
Gerstein, J Rozowsky, M Snyder, TR Gingeras, P Kapranov, SM Weissman, PE Newburger (2009) A myelopoiesis-associated regulatory intergenic
noncoding RNA transcript within the human HOXA
cluster. Blood 113: 2526-34.
LY
Wang, A Abyzov, JO Korbel,
M Snyder, M Gerstein (2009) MSB: a mean-shift-based
approach for the analysis of structural variation in the genome.
Genome Res 19: 106-17.
Z Wang, M Gerstein, M Snyder (2009) RNA-Seq: a revolutionary tool for transcriptomics.
Nat Rev Genet 10: 57-63.Nucleotide-resolution analysis of structural variants using BreakSeq and a breakpoint library.
An approach to comparing tiling array and high throughput
sequencing technologies for genomic transcript mapping.
R Sasidharan,
A Agarwal, J Rozowsky, M Gerstein (2009) BMC Res Notes 2: 150.
N Bhardwaj,
KK Yan, MB Gerstein (2010) Proc Natl Acad Sci U S A