a. Specific Aims

The aims for this year were:

(1) to further develop a tracking database for the NESG consortium;

(2) to assist in target selection for the consortium;

(3) to integrate functional genomics information with the targets and protein structural data, in general.

b. Studies and Results

During the past year we made progress on the aims. In particular:

(1) SPINE database and datamining

Last year, Bertone et al. (2001) set up a working version of the tracking database, which we called SPINE. We described this in detail in last year's progress report, when the paper was in press.

SPINE tracks the considerable amounts of information on the progress of functional characterization and structure determination for many proteins. Much of this information has a highly heterogenous quality, ranging from standardized Boolean values describing the expressibility or solubility of a construct to images of NMR and CD spectra. SPINE is available at nesg.org.

This year we continued on the SPINE project. We made a number of specific advances:

(i) We expanded SPINE's table structure and schema considerably beyond that in the published paper. We now allow multiple experimental records per target (e.g. for expression). We also integrated aggregation-screening data into SPINE, and the Columbia group began inputting this data.

(ii) We have redesigned the entire NESG website. We have integrated the lists of investigators and structures directly into SPINE, via queries. We have also built the website on an easy-to-modify WIKI platform, which allows pages to easily and securely edited by a remove user.

(iii) We have also expanded the systematic datamining in the first version of the SPINE paper to encompass the currently larger number of targets in the database and various subsets of the PDB TargetDB with solubility data. (There are 1596 SPINE targets and 3744 TargetDB targets with solubility data in June 2002). This allows us to create more accurate decision trees. We have integrated the having of protein-protein interactions into the features used for the data mining and found that this is a most useful features in predicting solubility.

(2) Pseudogenes and Eukaryotic Model Organism Gene Annotation

We have also investigated some interesting strategies in relation to selecting eukaryotic targets. Last year, Harrison et al. (2001) did a survey of the pseudogenes in the C elegans genome, performing a survey in 'molecular archaeology'. Corresponding to the 18 576 annotated proteins in the worm (i.e., in Wormpep18), we found an estimated total of 2168 pseudogenes, about one for every eight genes. We showed that the population of pseudogenes differs significantly from that of genes in a number of respects.

This year we have significantly extended these studies in a number of papers (Harrison et al., 2002a,b; Harrison & Gerstein, in press; Echols et al., 2002). In particular, we have assigned pseudogenes to other model eukaryotic organisms, yeast, fly, and human, and analyzed these assignments in detail. We found that the pseudogenes in yeast were similar to those in the worm, having a duplicated character, occurring on the ends of chromosomes, and being drawn from families of stress-response proteins (Harrison et al., 2002b).

The human genome has a very different population of pseudogenes (Harrison et al., 2002a). Most of them are derived from "processing" -- the reverse-transcription of mRNA back into the genome. This gives rise to a population of pseudogenes dominated by the hightly expressed ribosomal genes.

Over all genomes, the amino acid composition of pseudogenes is midway between that of genes and (translations of) random intergenic DNA, and the most common protein folds and families differ somewhat between genes and pseudogenes (Harrison & Gerstein, in press; Echols et al., 2002). In addition, the size of a gene family bears little overall relationship to the size of its corresponding pseudogene complement, indicating a highly dynamic genome. There are, in fact, a number of families associated with large populations of pseudogenes.

Our work with pseudogenes is relevant to target selection in two respects. First, identification of methods for detecting and validating pseudogenes helps us to more accurately assess the validity of each predicted eukaryotic gene, an essential first step in target selection. Second, we identified and explored a potentially interesting target selection strategy, prioritizing protein families by the number of pseudogenes that they were associated with. That is, such prioritized targets are associated with more dead proteins.

(3) Using PartsList for Integrating Functional Genomics Data with Folds

We have done a considerable amount of work integrating protein fold data with functional genomics information through setting up the PartsList system. This system acts a general platform for this sort of integration.

We reported an initial version of the system in last year's report (Qian et al., 2001). This year we have extended the system in a number of important respects:

(i) We have shown that occurrence of protein folds and families follows distinctive power-law statistics in many disparate genomic contexts (Luscombe et al., in press): that is, the way that the frequency of many genomic attributes falls off with approximate power-law behavior (i.e. according to V^-b, for attribute value V and constant exponent b), with a few folds having large values and most having small values. This has important implications for the number of big families in relation to target selection and the degree to which one can achieve coverage of fold space.

(ii) We have used PartsList to identify the different characteristics of common, unique and horizontally transferred folds (Hegyi et al., 2002). We have also used the system to measure the degree to which fold and function are correlated, an important fact for calibrating annotation transfer on the basis of sequence similarity (Hegyi & Gerstein, 2001).

(iii) We have developed systems for integrating expression information with protein folds and families and measured the degree to which this information could be used to predict the known interactions in protein complexes (Jansen et al., in press; Gerstein et al., 2002; Greenbaum et al., 2001).

c. Significance

We believe we were the first of the structural genomics centers in publishing a tracking database and integrated datamining approach.

We believe our PartsList system will be a useful platform to help in the analysis of the results of the consortium.

We believe that our pseudogenes analysis gives a novel perspective on understanding large eukaryotic protein families -- the targets of the NESG consortium.

d. Plans

We plan to continue with the project as outlined in the original proposal. In particular, we will keep maintaining and expanding the SPINE database system, which hope to generalize in a general system for proteomics.

We will also continue to build on our PartsList system and integrate more functional genomics information onto the various fold parts.

We are particularly enthusiastic about using SPINE and PartsList as platforms to integrate information relating to protein-protein interactions. We would like to give each determined structure an "interaction-context" that describes all the potential interactions it has in the genome, as determined by many possible approaches -- e.g. two-hybrid, expression correlations, and functional roles.

e. Publications

(1) Papers Published since last report

82. D Greenbaum, N M Luscombe, R Jansen, J Qian, M Gerstein.
"Interrelating Different Types of Whole-genome Data, from Proteome to Secretome: 'Oming in on Function."
Genome Research 11: 1463-1468 (2001).

89. H Hegyi, M Gerstein.
"Annotation Transfer for Genomics: Measuring Functional Divergence in Multi-domain Proteins,"
Genome Research 11: 1632-1640 (2001).

97. P Harrison, H Hegyi, P Bertone, N Echols, T Johnson, S Balasubramanian, N Luscombe, M Gerstein
"Molecular fossils in the human genome: Identification and analysis of pseudogenes in chromosomes 21 and 22."
Genome Research 12: 272-280 (2002)

99. P Harrison, A Kumar, N Lan, N Echols, M Snyder, M Gerstein
"A small reservoir of disabled ORFs in the yeast genome and its implications for the dynamics of proteome evolution."
J Mol Biol 316: 409-419 (2002)

102. H Hegyi, J Lin, D Greenbaum, M Gerstein.
"Structural Genomics Analysis: Phylogenetic Patterns of Unique, Shared, and Common Folds in 20 Genomes."
Proteins 47: 126-141 (2002)

93. Gerstein M, Lan N, Jansen R.
"Proteomics. Integrating interactomes."
Science 295:284-7. (2002)

108. N Echols, P Harrison, P Bertone, S Balasubramanian, N Luscombe, Z Zhang, M Gerstein
"Comprehensive Analysis of Amino Acid and Nucleotide Composition in Eukaryotic Genomes, Comparing Genes and Pseudogenes."
Nuc. Acids Res. 30:2515-23.

104. P Harrison, M Gerstein
"Studying Genomes through the Aeons: Protein families, Pseudogenes, and Proteome Evolution."
J Mol Biol (in press)

113. R Jansen, N Lan, J Qian, M Gerstein
"Integration of genomic data to predict protein complexes in yeast."
J Struc. Func. Genomics (in press)

114. N Luscombe, J Qian, T Johnson, M Gerstein.
"Power-law behaviour applies to a wide variety of genomic properties,"
GenomeBiology (in press)

(2) Papers Listed in Last Year's report (for reference)

72. J Qian, B Stenger, C Wilson, J Lin, R Jansen, W Krebs, V Alexandrov, N Echols, S Teichmann, J Park, M Gerstein.
"PartsList: a web-based system for dynamically ranking protein folds based on disparate attributes, including whole-genome expression and interaction information."
Nucleic Acids Res 29: 1750-64 (2001).

73. P Harrison, N Echols, M Gerstein.
"Digging for Dead Genes: An Analysis of the Characteristics of the Pseudogene Population in the C. elegans Genome."
Nuc. Acids. Res. 29 : 818-30 (2001).

76. P Bertone, Y Kluger, N Lan, D Zheng, D Christendat, A Yee, A Edwards, C Arrowsmith, G Montelione, M Gerstein.
"SPINE: An integrated tracking database and data mining approach for identifying feasible targets in high-throughput structural proteomics."
Nucleic Acids Res 29: 2884-98 (2001).

f. Project-generated Resources

From our websites, http://www.nesg.org and http://www.partslist.org we make available:

- lists of the nesg targets and associated experimental data

- information about the ranking of each protein fold part with respect to many other attributes

We have also setup subsidiary websites devoted to pseudogenes (http://pseudogene.org) and integrating interactions (http://genecensus.org).