Year 2

A. Overview of Work Completed

As I mapped out in my proposal, most of my work in the second year was applying my library of structural and function templates to a moderate number of genomes, developing new methods of predicting protein attributes (e.g. function) based on expression data, and constructing an appropriate database infrastructure to handle all this. We have also participated in a number of collaborations with experimental scientists in proteomics, A Edwards and C Arrowsmith in Toronto and L Regan and M Snyder at Yale. A related highlight of the year was surveying pseudogenes in the worm. I have graphically summarized some aspects of my current work in Exhibit 1.

We have currently focused work on a standard dataset of 20 genomes, 18 prokaryotes plus yeast and worm. We are working now on fitting in the fly, cress, and, most importantly, human into this framework. It is essential to integrate the human genome into comparisons with microbial ones. Identifying families and folds in microbial pathogens that are not in humans holds great promise for suggesting drug targets.

B. Specific Highlights from the Past Year

I summarize below in some detail the specific highlights from the past year, connecting each to the relevant publication, the full citation of which is listed in section 5 below.

i. Analysis of Expression Data in Relation to Function

One of the aims of the grant was to develop new approaches to predicting protein properties, such as structure, function, interactions, and localization, on a genome-wide scale. We have achieved this through using gene expression data. We observed that levels of gene expression were closely correlated with a protein's eventual subcellular localization, with high levels of gene expression characteristic of nuclear proteins and low levels characteristic of nuclear and membrane proteins (Drawid et al., 2000, TIG).

From this we were able to develop a system to integrate expression information and sequence pattern information for yeast in a Bayesian network (Drawid & Gerstein, 2000, JMB). This allows the prediction of the subcellular localization of the ~4000 yeast proteins with unknown localization. We also studied our ability to predict protein function based on expression (Gerstein & Jansen, 2000, COSB), finding a relationship that applied for certain classes of experiments and functions but did not apply globally.

ii. Genome-wide Characterization of Protein Function in Microbes

Following on our analysis of the relation between expression and function, we explored a number of approaches to large-scale characterizations of protein function, clearly one of the major goals of genome analysis. We developed methods of describing the functional shifts based on changing patterns of residue conservation (Naylor & Gerstein, 2000, JME). We have begun preliminary work on the analysis of metabolic pathways in pathogens (Das et al., 2000, J Mol. Microl Biotech). Finally, we are collaborating with Prof Michael Snyder on a large-scale functional analysis of the yeast genome. Prof Snyder has developed a system of experimentally assessing the functions of thousands of yeast genes with a protein chips. We have helped interpret these experiments computationally, developing a database for the results and clustering them (Zhu et al., 2000, Nat. Genetics).

iii. Pseudogenes

We have published a survey of pseudogenes in worm genome (Harrison et al., 2001, NAR). This represents the analysis of a large metazoan genome in terms of protein structure. It describes the occurrence of common folds and families in pseudogenes. Some pseudogenes are highlighted as possibly being transferred from microorganisms.

iv. Collaboration on Experimental Structural Genomics of Microbes

We have done a variety of computational analyses designed to interface with experimental structural genomics. This work is the direct experimental complement of the computational work proposed in the grant. Our efforts are part of the North East Structural Genomics Consortium (NESG). We have designed approaches to pick targets prospectively for subsequent structural analysis and then to do retrospective data mining of the results. In particular, we have collaborated with the Ontario Proteomics group lead by C Arrowsmith and A Edwards on helping them to analyze their large-scale structural genomics analysis of the archeon M. thermoautotropicum (Christendat, 2000, Nature Struc. Biol.). Our analysis consisted of building decision trees to help predict which proteins would perform well in high-throughput protein purification.

In a separate analysis, we collaborated with L Regan at Yale in identifying unusual proteins in a small model genome, that of M. genitalium. We identified 11 proteins that had no known structure or function but had homologs in other genomes. These were subject to subsequent CD analysis (Balasubramanian, 2000, NAR).

v. Large scale integrative database systems

I have developed a number of systems for integrating much heterogeneous information related to microbial genomes. In particular, we have built three main database systems for our analyses.

a) The first system is called PartsList (Qian et al., 2001, NAR). It is principally orientated towards annotating one of the existing structural classifications of proteins, the scop scheme. The central metaphor for the annotation is that of "ranking" folds, finding the most common folds based on a variety of different metrics.

b) The second system is a SPINE (Bertone et al., in press, NAR). It is built as a part of a large collaboration with the Northeast structural genomics consortium (described above). It enables researchers who are part of this consortium to collect and rank targets for high-throughput structural genomics.

c) The third system, which we call GeneCensus, tabulates results related to genes and genomes. (This is currently unpublished). Its central metaphor is a "tree" arranging genomes. The tree can built based on various measures of relatedness - e.g. number of shared orthologs, number of shared folds, amino acid identity of individual orthologous proteins, overall genome composition, etc. These measures of relatedness occur at different levels, whole-genome, partial-proteome, and individual gene.

All parts of the systems interact. So, for instance, it is possible to see how many genome matches there are for a particular structure and then to click on these matches and see the GeneCensus genome annotation for each of them. Also, each target in the construct database links into GeneCensus, which validates that its structure is not currently known, and the all the solved structures from the consortium have annotation in the PartList.

Integrative database analysis is essential for all this analysis. I have been called on to write a number of prominent surveys and opinions in this area, particularly concerning the question of how integrated databases will interface with the biological literature (Gerstein, 2000, Nat. Struc. Biol).