Research Report 2009

Overview of Research in the Gerstein Lab in 2009

(references from papers.gersteinlab.org)

The amount of raw genomic data that is being generated these days is astonishing: thanks to next-generation sequencing techniques and other high-throughput methods. Keeping in pace with new technology and data, we continue to explore the vast genome space of humans and microbes. We are developing new computational methods and tools to understand the genomic landscape, the proteins that a genome encodes and the numerous networks that include the interplay of proteins, DNA and other molecules that choreograph the function of a cell. At a fundamental level, we strive to annotate the human genome with newly identified attributes, develop tools for genomic analyses, and integrate data from various experiments to understand interrelated processes and networks. Here is a brief synopsis of our research during the calendar year 2009.

Genome annotation

Pseudogenes

On the pseudogene front, we have moved on from global analyses of pseudogenes to detailed study of interesting families of pseudogenes. We have comprehensively looked at two groups of pseudogenes: ribosomal protein pseudogenes and pseudogenes of glycolytic enzymes. Ribosomal protein pseudogenes constitute one of the largest families of pseudogenes (Balasubramanian et al., 2009). Our analysis indicates that RP protein pseudogenes abound in mammals, but very few are conserved between rodent and primates. This highlights the large amount of recent retrotranspositional activity in mammals and a relatively larger amount of it in the rodent lineage. We also comprehensively annotated and analyzed pseudogenes of glycolytic enzymes in several species such as chimpanzee, mouse, rat, chicken, pufferfish, etc. Based on the comparative analyses, we showed that most pseudogenes are recent, having risen after the divergence of rodent and primate lineages (Liu et al, 2009). We have also developed a database called Pseudofam, (http://pseudofam.pseudogene.org), which is a database of pseudogene families based on protein families from Pfam database (Lam et al, 2009). It provides resources for analyzing the family structure of pseudogenes including query tools, statistical summaries and sequence alignments.

Structural variants (SVs)

It has become clear that inter-individual differences are not restricted to SNPs. It has been shown that blocks of DNA are either deleted or inserted in various people. Thus, SVs are a novel form of variation between people that has been recently identified. With the availability of unprecedented amounts of sequencing data now, it is an opportune time to study SVs. However, identification and analysis of SVs is in its early stages.

We have developed a computational approach, PEMer, with simulation-based error models, which facilitates the identification of SVs from large-scale paired end sequencing data (Korbel et al, 2009b). PEMer can be used to process data from several widely applied next-generation sequencing platforms. We showed that using PEMer results in an improved SV-calling performance based on simulations and real datasets. We re-scored a recently published dataset with PEMer, and identified 18 new SVs.

We have built a simulation toolbox that optimizes the combination of different technologies to perform comparative genome re-sequencing, especially in reconstructing large SVs (Du et al, 2009). SV reconstruction is a difficult step in human genome re-sequencing. We show that combining different read lengths is more cost-effective than using one length, an optimal mixed sequencing strategy for reconstructing large novel SVs gives accurate detection of SNPs/indels and paired-end reads can improve reconstruction efficiency. Our strategy should facilitate the sequencing of human genomes at maximum accuracy and low cost. Our simulation results quantitatively show how much improvement one can gain in reconstructing large structural variants by integrating different technologies in optimal ways.

Transcription factor binding sites

ChIP sequencing (ChIP-seq) has become a favorite method for genome-wide mapping of transcription factor binding sites on DNA. But analysis of this data is far from trivial. We have developed Peak-Seq, an approach to identify peak regions in ChIP-seq data sets that correspond to sites of transcription factor binding (Rozowsky et al, 2009). We developed a method for scoring the results of ChIP-seq experiments by compensating for the mappability map and comparing these results against a normalized matching control data set. For computational efficiency we adopt a two-pass approach for scoring ChIP-seq data relative to a control data set.

Genome Analysis: new tools, methods and application

We developed a computational approach that integrates microarray expression data with the transcription factor binding site information to systematically identify transcription factors associated with patient survival given a specific cancer type (Cheng et al, 2009c). Using gene expression data sets for breast cancer and acute myeloid leukemia, we found that two transcription factor families, the steroid nuclear receptor family and the ATF/CREB family, are significantly correlated with the survival of patients with breast cancer; and that a transcription factor named T-cell acute lymphocytic leukemia 1 is significantly correlated with acute myeloid leukemia patient survival.

We have compared two different technologies for transcriptome profiling: tiling microarrays and transcriptome profiling using published rice and Arabidopsis datasets (Sasidharan et al, 2009). Based on mapped probe intensities onto sequencing tags, we show that there is a reasonable overlap in transcripts identified by the two technologies.

MicroRNAs (miRNAs) are endogenous small RNA molecules that modulate the gene expression at the post-transcription levels in many eukaryotic cells. We studied the relationship between the evolution of microRNA targets and the lengths of their UTRS (Cheng et al, 2009a). Our analysis implies a two-way evolutionary mechanism for miRNA targets based on their cellular roles and the length of their 3' UTRs. Functionally critical genes that are spatially or temporally expressed are stringently regulated by miRNAs. On the other hand, housekeeping genes seem to have shorter 3'UTRs to avoid miRNA regulation.

The regulatory effects of microRNAs have been investigated by examining expression changes of their target genes. Therefore, we defined an overall metric of regulatory effect (RE-score) for a specific microRNA to see how this changes across different conditions (Cheng et al, 2009b). We defined a RE-score to measure the inhibitory effect of a microRNA in a sample, as the average difference in expression of its targets versus non-targets. We used this measure to differentiate between ER+ and ER- breast cancers. We applied this approach to five microarray breast cancer datasets and found that the expression of target genes of most microRNAs was more repressed in ER- than ER+; that is, microRNAs have higher RE-scores in ER- breast cancer.

Network analysis

Networks provide a natural framework for the organization and quantitative representation of all the available data about molecular interactions. Most molecular network analyses treat molecular networks as modular, isolated units. We describe recent advances in the analysis of modularity in biological networks, focusing on the increasing realization that a dynamic perspective is essential to grouping molecules into modules and determining their collective function (Alexander et al, 2009).

Metagenomics is the study of metagenomes, genetic material recovered directly from environmental samples. With the advances in sequencing technology, sequencing millions of microbes en-mass has become a reality. However, making sense of this kind of behemoth amounts of data is challenging. We used the global ocean survey (GOS) dataset to understand how these genomic sequences link distinct environmental conditions with specific biological processes (Gianoulis et al, 2009). Our interest was in understanding how particular pathways and subnetworks reflect the adaptation of microbial communities across environments and habitats. We introduced an approach that employs correlation and regression to relate multiple, continuously varying factors defining an environment to the extent of particular microbial pathways present in a geographic site. Moreover, we adapted canonical correlation analysis and related techniques to define an ensemble of weighted pathways rather than looking only at one-to-one correlations. We identified footprints predictive of their environment that can potentially be used as biosensors. For example, we show a strong multivariate correlation between the energy-conversion strategies of a community and multiple environmental gradients such as temperature. We also identified covariation in amino acid transport and cofactor synthesis, which suggests that limiting amounts of cofactor can explain increased import of amino acids in nutrient-limited conditions.

Protein-protein interactions occur in numerous ways. They can arise due to interactions between individual residues, multiple residues forming a binding surface or at a protein domain level. Using a machine-learning algorithm, we showed that combining features at all levels results in improved prediction accuracy compared to predictions based on using them individually in a representative yeast network (Yip et al, 2009a).

Macromolecular structure, function and dynamics

We have looked at protein conformational changes in relation to two key structural metrics: packing efficiency and disorder (Bhardwaj and Gerstein, 2009). It is well known that packing is very important for protein stability and function. We studied changes in packing efficiency during conformational changes, thus extending the analysis from a static context to a dynamic perspective. Our results show that the cores of the proteins remain mostly intact between two conformations, whereas the interfaces display the most elasticity, both in terms of disorder and change in packing efficiency.