Year 1

Work completed

As I mapped out in my original proposal, most of my work in the first year was scaling up my comparative approach to larger numbers of genomes and developing the database infrastructure to handle this. I have begun work on a standard dataset of 20 genomes, 18 prokaryotes plus yeast and the worm. Accommodating the worm, and eventually, I hope, the fly, in this framework poses many new challenges.

We have built three main database systems for this analysis. (1) The first, which we call GeneCensus, tabulates results related to genomes and genes. Its central metaphor is a "tree" arranging genomes. The tree can built based on various measures of relatedness - e.g. number of shared orthologs, number of shared folds, amino acid identity of individual orthologous proteins, overall genome composition, etc. These measures of relatedness occur at different levels, whole-genome, partial-proteome, and individual selected gene. (2) The second system is called PartsList. It is principally orientated towards annotating one of the existing structural classifications of proteins, the scop scheme. The central metaphor for the annotation is that of "ranking" folds, finding the most common folds based on a variety of different metrics. (3) The third system is a construct database. It is built as a part of a large collaboration with the Northeast structural genomics consortium. It enables researchers who are part of this consortium to collect and rank targets for high-throughput structural genomics. All parts of the system interact. So, for instance, it is possible to see how many genome matches there are for a particular structure and then to click on these matches and see the GeneCensus genome annotation for each of them. Also, each target in the construct database links into GeneCensus, which validates that its structure is not currently known, and the all the solved structures from the consortium have annotation in the PartList database.

Specific Highlights

Some of the results from our database system have been published: