Overview of Research in the Gerstein
Lab in 2009
(references from papers.gersteinlab.org)
The amount of raw genomic data that is being generated these days is astonishing: thanks to next-generation sequencing techniques and other high-throughput methods. Keeping in pace with new technology and data, we continue to explore the vast genome space of humans and microbes. We are developing new computational methods and tools to understand the genomic landscape, the proteins that a genome encodes and the numerous networks that include the interplay of proteins, DNA and other molecules that choreograph the function of a cell. At a fundamental level, we strive to annotate the human genome with newly identified attributes, develop tools for genomic analyses, and integrate data from various experiments to understand interrelated processes and networks. Here is a brief synopsis of our research during the calendar year 2009.
Genome
annotation
Pseudogenes
On the pseudogene front, we have moved on from global analyses of pseudogenes to detailed study of interesting families of pseudogenes. We have comprehensively looked at two groups
of pseudogenes: ribosomal protein pseudogenes
and pseudogenes of glycolytic
enzymes. Ribosomal protein pseudogenes constitute one
of the largest families of pseudogenes (Balasubramanian et al., 2009). Our analysis indicates that
RP protein pseudogenes abound in mammals, but very
few are conserved between rodent and primates. This highlights the large amount
of recent retrotranspositional activity in mammals
and a relatively larger amount of it in the rodent lineage. We also
comprehensively annotated and analyzed pseudogenes of
glycolytic enzymes in several species such as
chimpanzee, mouse, rat, chicken, pufferfish, etc.
Based on the comparative analyses, we showed that most pseudogenes
are recent, having risen after the divergence of rodent and primate lineages
(Liu et al, 2009). We have also developed a database called Pseudofam, (http://pseudofam.pseudogene.org), which is a database of pseudogene
families based on protein families from Pfam database
(Lam et al, 2009). It provides resources for analyzing the family structure of pseudogenes including query tools, statistical summaries
and sequence alignments.
Structural variants (SVs)
It has become clear
that inter-individual differences are not restricted to SNPs.
It has been shown that blocks of DNA are either deleted or inserted in various
people. Thus, SVs are a novel form of variation
between people that has been recently identified. With the availability of
unprecedented amounts of sequencing data now, it is an opportune time to study SVs. However, identification and analysis of SVs is in its early stages.
We have developed a
computational approach, PEMer, with simulation-based
error models, which facilitates the identification of SVs
from large-scale paired end sequencing data (Korbel
et al, 2009b). PEMer can be used to process data from
several widely applied next-generation sequencing platforms. We showed that
using PEMer results in an improved SV-calling
performance based on simulations and real datasets. We re-scored a recently
published dataset with PEMer, and identified 18 new SVs.
We have built a simulation
toolbox that optimizes the combination of different technologies to perform
comparative genome re-sequencing, especially in reconstructing large SVs (Du et al, 2009). SV reconstruction is a
difficult step in human genome re-sequencing. We show that
combining different read lengths is more cost-effective than using one length,
an optimal mixed sequencing strategy for reconstructing large novel SVs gives accurate detection of SNPs/indels
and paired-end reads can improve reconstruction efficiency. Our strategy should
facilitate the sequencing of human genomes at maximum accuracy and low cost.
Our simulation results quantitatively show how much improvement
one can gain in reconstructing large structural variants by
integrating different technologies in optimal ways.
Transcription factor
binding sites
ChIP sequencing (ChIP-seq) has
become a favorite method for genome-wide mapping of transcription factor
binding sites on DNA. But analysis of this data is far from trivial. We have
developed Peak-Seq, an approach to identify peak
regions in ChIP-seq data sets that correspond
to sites of transcription factor binding (Rozowsky et
al, 2009). We developed a method for scoring the results of ChIP-seq
experiments by compensating for the mappability map
and comparing these results against a normalized matching control
data set. For computational efficiency we adopt a two-pass
approach for scoring ChIP-seq data relative to a
control data set.
Genome Analysis:
new tools, methods and application
We developed a
computational approach that integrates microarray expression data with the
transcription factor binding site information to systematically identify
transcription factors associated with patient survival given a specific cancer
type (Cheng et al, 2009c). Using gene expression data sets for breast cancer
and acute myeloid leukemia, we found that two transcription factor families,
the steroid nuclear receptor family and the ATF/CREB family, are significantly
correlated with the survival of patients with breast cancer; and that a
transcription factor named T-cell acute lymphocytic leukemia 1 is significantly
correlated with acute myeloid leukemia patient survival.
We have compared two different technologies for transcriptome profiling: tiling microarrays and transcriptome profiling using published rice and Arabidopsis datasets (Sasidharan et al, 2009). Based on mapped probe intensities onto sequencing tags, we show that there is a reasonable overlap in transcripts identified by the two technologies.
MicroRNAs (miRNAs) are endogenous
small RNA molecules that modulate the gene expression at the post-transcription
levels in many eukaryotic cells. We studied the relationship between the
evolution of microRNA targets and the lengths of
their UTRS (Cheng et al, 2009a). Our analysis implies a two-way evolutionary
mechanism for miRNA targets based on their cellular
roles and the length of their 3' UTRs. Functionally
critical genes that are spatially or temporally expressed are stringently
regulated by miRNAs. On the other hand, housekeeping
genes seem to have shorter 3'UTRs to avoid miRNA
regulation.
The regulatory
effects of microRNAs have been investigated by
examining expression changes of their target genes. Therefore, we defined an
overall metric of regulatory effect (RE-score) for a specific microRNA to see how this changes across different
conditions (Cheng et al, 2009b). We defined a RE-score to measure the
inhibitory effect of a microRNA in a sample, as the
average difference in expression of its targets versus non-targets. We used
this measure to differentiate between ER+ and ER- breast cancers. We applied
this approach to five microarray breast cancer datasets and found that the
expression of target genes of most microRNAs was more
repressed in ER- than ER+; that is, microRNAs have
higher RE-scores in ER- breast cancer.
Network analysis
Networks provide a
natural framework for the organization and quantitative representation of all
the available data about molecular interactions. Most molecular network
analyses treat molecular networks as modular, isolated units. We describe
recent advances in the analysis of modularity in biological networks, focusing
on the increasing realization that a dynamic perspective is essential to
grouping molecules into modules and determining their collective function
(Alexander et al, 2009).
Metagenomics is the study of metagenomes,
genetic material recovered directly from environmental samples. With the
advances in sequencing technology, sequencing millions of microbes en-mass has
become a reality. However, making sense of this kind of behemoth amounts of
data is challenging. We used the global ocean survey (GOS) dataset to
understand how these genomic sequences link distinct environmental conditions
with specific biological processes (Gianoulis et al,
2009). Our interest was in understanding how particular pathways and subnetworks reflect the adaptation of microbial communities
across environments and habitats. We introduced an approach that
employs correlation and regression to relate multiple, continuously varying
factors defining an environment to the extent of particular microbial pathways
present in a geographic site. Moreover, we adapted canonical correlation
analysis and related techniques to define an ensemble of weighted pathways
rather than looking only at one-to-one correlations. We identified footprints
predictive of their environment that can potentially be used as biosensors. For
example, we show a strong multivariate correlation between the
energy-conversion strategies of a community and multiple environmental
gradients such as temperature. We also identified covariation
in amino acid transport and cofactor synthesis, which suggests that limiting
amounts of cofactor can explain increased import of
amino acids in nutrient-limited conditions.
Protein-protein interactions occur in numerous ways. They can arise due to interactions between individual residues, multiple residues forming a binding surface or at a protein domain level. Using a machine-learning algorithm, we showed that combining features at all levels results in improved prediction accuracy compared to predictions based on using them individually in a representative yeast network (Yip et al, 2009a).
Macromolecular structure, function and dynamics
We have looked at protein conformational changes in relation to two key structural metrics: packing efficiency and disorder (Bhardwaj and Gerstein, 2009). It is well known that packing is very important for protein stability and function. We studied changes in packing efficiency during conformational changes, thus extending the analysis from a static context to a dynamic perspective. Our results show that the cores of the proteins remain mostly intact between two conformations, whereas the interfaces display the most elasticity, both in terms of disorder and change in packing efficiency.