a

a. Specific Aims

The aims for this year were: (i) to set up a working tracking database for the NESG consortium; (ii) to assist in target selection for the consortium; (iii) to integrate functional genomics information with the targets and protein structural data, in general.

b. Studies and Results

During the past year we made progress on the aims. In particular:

(i) Bertone et al. (2001) set up a working version of the tracking database, which we call SPINE. SPINE tracks the considerable amounts of information on the progress of functional characterization and structure determination for many proteins. To accommodate the needs of the various Consortium members, investigators from several labs were involved in selecting the most appropriate information to be tracked by the system. Much of this information has a highly heterogenous quality, ranging from standardized Boolean values describing the expressibility or solubility of a construct to images of NMR and CD spectra.

SPINE is available at spine.mbb.yale.edu or nesg.org. The database is specifically designed for enabling distributed scientific collaboration via the Internet. It features an intuitive user interface for interactive retrieval and modification of expression construct data, query forms designed to track global project progress, and external links to many other connected resources. SPINE also has a BLAST interface allowing it to be searched by sequence and the ability to dump most of its contents in standard XML or mmCIF formats. Currently, the database contains information on ~2500 active expression constructs and many thousands more potential targets, which are drawn from a variety of organisms including M. thermoautotrophicum, M. genitalium, S. cerevisiae, and C. elegans. It has already been used to select targets and store proteomics data for two published studies (Balasubramanian et al., 2000; Christendat et al., 2000a,b) and was used a model for the NIH-sponsored target registry at the recent Airlie House meeting in Washington.

(ii) We have also investigated some interesting strategies in relation to selecting eukaryotic, especially worm, targets. Harrison et al. (2001) did a survey of the pseudogenes in the C elegans genome, performing a survey in 'molecular archaeology'. Corresponding to the 18 576 annotated proteins in the worm (i.e., in Wormpep18), we have found an estimated total of 2168 pseudogenes, about one for every eight genes. Few of these appear to be processed. We found that the population of pseudogenes differs significantly from that of genes in a number of respects: (i) pseudogenes are distributed unevenly across the genome relative to genes, with a disproportionate number on chromosome IV; (ii) the density of pseudogenes is higher on the arms of the chromosomes; (iii) the amino acid composition of pseudogenes is midway between that of genes and (translations of) random intergenic DNA, with enrichment of Phe, Ile, Leu and Lys, and depletion of Asp, Ala, Glu and Gly relative to the worm proteome; and (iv) the most common protein folds and families differ somewhat between genes and pseudogenes-whereas the most common fold found in the worm proteome is the immunoglobulin fold and the most common 'pseudofold' is the C-type lectin. In addition, the size of a gene family bears little overall relationship to the size of its corresponding pseudogene complement, indicating a highly dynamic genome. There are in fact a number of families associated with large populations of pseudogenes.

This is relevant to target selection in two respects. First, identification of methods for detecting and validating pseudogenes helps us to more accurately assess the validity of each predicted worm gene, an essential first step in target selection. Second, we identified and explored a potentially interesting target selection strategy, prioritizing worm protein families by the number of pseudogenes that they were associated with. That is, such prioritized targets are associated with more dead proteins.

(iii) We have done a considerable amount of work integrating protein fold data with functional genomics information through setting up the PartsList system. This system acts a general platform for this sort of integration. In particular, Qian et al. (2001) developed a new resource that lets one dynamically perform these comparative fold surveys. PartsList is based on the existing fold classifications and functions as a form of companion annotation for them, providing 'global views' of many already completed fold surveys. The central idea in the system is that of comparison through ranking; PartsList will rank the approximately 420 folds based on more than 180 attributes. These include: (i) occurrence in a number of completely sequenced genomes (e.g. it will show the most common folds in the worm versus yeast); (ii) occurrence in the structure databank (e.g. most common folds in the PDB); (iii) both absolute and relative gene expression information (e.g. most changing folds in expression over the cell cycle); (iv) protein-protein interactions, based on experimental data in yeast and comprehensive PDB surveys (e.g. most interacting fold); (v) sensitivity to inserted transposons; (vi) the number of functions associated with the fold (e.g. most multi-functional folds); (vii) amino acid composition (e.g. most Cys-rich folds); (viii) protein motions (e.g. most mobile folds); and (ix) the level of similarity based on a comprehensive set of structural alignments (e.g. most structurally variable folds). The integration of whole-genome expression and protein-protein interaction data with structural information is a particularly novel feature of our system.

We provide three ways of visualizing the rankings: a profiler emphasizing the progression of high and low ranks across many pre-selected attributes, a dynamic comparer for custom comparisons and a numerical rankings correlator. These allow one to directly compare very different attributes of a fold (e.g. expression level, genome occurrence and maximum motion) in the uniform numerical format of ranks. This uniform framework, in turn, highlights the way that the frequency of many of the attributes falls off with approximate power-law behavior (i.e. according to V-b, for attribute value V and constant exponent b), with a few folds having large values and most having small values.

c. Significance

We believe we the first of the structural genomics centers in publishing a tracking database and integrated datamining approach.

We believe our PartsList system will be a useful platform to help in the analysis of the results of the consortium.

We believe that our pseudogenes analysis gives a novel perspective on prioritizing large families.

d. Plans

We plan to continue with the project as outlined in the original proposal. In particular, we will keep maintaining and expanding the SPINE database system, which hope to generalize in a general system for proteomics. We will also continue to build on our PartsList system and integrate more functional genomics information onto the various fold parts. In particular, we are enthusiastic about using this to integrate information relating to protein-protein interactions. We would like to give each determined structure an "interaction-context" that describes all the potential interactions it has in the genome, as determined by many possible approaches -- e.g. two-hybrid, expression correlations, and functional roles.

e. Publications

P Bertone, Y Kluger, N Lan, D Zheng, D Christendat, A Yee, A Edwards, C Arrowsmith, G Montelione, M Gerstein (2001).
"SPINE: An integrated tracking database and data mining approach for identifying feasible targets in high-throughput structural proteomics."
Nucleic Acids Res (in press).

J Qian, B Stenger, C Wilson, J Lin, R Jansen, W Krebs, V Alexandrov, N Echols, S Teichmann, J Park, M Gerstein.
"PartsList: a web-based system for dynamically ranking protein folds based on disparate attributes, including whole-genome expression and interaction information."
Nucleic Acids Res 29: 1750-64 (2001).

P Harrison, N Echols, M Gerstein.
"Digging for Dead Genes: An Analysis of the Characteristics of the Pseudogene Population in the C. elegans Genome."
Nuc. Acids. Res. 29 : 818-30 (2001).

f. Project-generated Resources

From our website, http://www.nesg.org and http://www.partslist.org we make available:

- lists of the nesg targets and associated experimental data

- information about the ranking of each protein fold part with respect to many other attributes