a. Specific Aims

The aims for this year were:

(1a) to further develop a tracking database for the NESG consortium;

(1b) to perform datamining on this database;

(2) to integrate functional genomics information with the targets and protein structural data, in general;

(3) to assist in target selection for the consortium;

b. Studies and Results

During the past year we made progress on the aims. In particular:

(1) SPINE database, datamining, and ontologies

Our web-based SPINE Database is our core technology for integrating the activities of the NESG Consortium, with links to/from each of our public-domain targets to world-wide databases of genomic, functional, and structural information. It is quite sophisticated and easy to use. SPINE tracks the considerable amounts of information on the progress of functional characterization and structure determination for many proteins. Much of this information has a highly heterogeneous quality, ranging from standardized Boolean values describing the expressibility or solubility of a construct to images of NMR and CD spectra. SPINE is available at spine.nesg.org.

However, in its current form SPINE does not fully meet the broad needs of laboratory information management encountered in such a large-scale, multi-institute project. It is a tremendous challenge, and very expensive, to develop a system sufficiently powerful, yet flexible and robust, to meet all the needs we encounter in laboratory information management.

This year we continued on the SPINE project. We made a number of specific advances:

(i) We expanded SPINE's table structure and schema (Goh et al., 2003). SPINE 2 now tracks down to the tube level.

(ii) We have generalized the schema in spine to systematic descriptions or ontologies of protein properties (Lan et al., 2002; Lan et al., 2003).

(iii) We have continued to redesign the entire NESG website (www.nesg.org). One special feature of this updated site is the use of wiki technology, which allows interactive editing of the web site (under password control). This allows authorized users to edit pages, create new pages, create links, etc. without having to post modified pages on the web server, which is maintained at Yale University. The web page is maintained by B. Klein (Admin Asst., Rutgers) under the direction of M. Gerstein.

The website is integrated with the SPINE database and includes a structure gallery and facilities for tracking publications and integrating mailing list correspondence with specific targets.

(iv) We have done some datamining on the results in SPINE and in the whole of the NIH TargetDB (Gerstein et al., 2003; Savchenko et al., 2003).

(2 and 3) Structural Genomics Analysis and Protein-protein interactions

This year we published a paper on structural genomics analysis, identifying the characteristics of the unique folds in various prokaryote genomes that did not occur in other genomes and also the characteristics of common folds (Hegyi et al., 2003).  We found in particular that common folds tend to have a more symmetrical structure than unique folds.

We also integrated structural information with protein-protein interactions using the data related to protein complexes to assess the reliability of large scale genomic protein-protein interaction sets (Edwards et al., 2002).  This work was done in collaboration with the team in Toronto.  For this we employed a number of simple graph-analysis and Bayesian algorithms.

(4) Pseudogenes and Eukaryotic Model Organism Gene Annotation

We have also investigated some interesting strategies in relation to selecting eukaryotic targets. Previously, Harrison et al. did a survey of the pseudogenes in the C elegans genome, performing a survey in 'molecular archaeology'. This year we have significantly extended these studies in a number of papers. Harrison & Gerstein (2002) is a broad survey of pseudogenes and protein families. Zhang et al. (2002) expands the discussion of pseudogenes to pseudomotifs.

(Please note only a small part of our pseudogenes work is funded through NESG. Most is associated with other grants to the Gerstein lab.)

Our work with pseudogenes is relevant to target selection in two respects. First, identification of methods for detecting and validating pseudogenes helps us to more accurately assess the validity of each predicted eukaryotic gene, an essential first step in target selection. Second, we identified and explored a potentially interesting target selection strategy, prioritizing protein families by the number of pseudogenes that they were associated with. That is, such prioritized targets are associated with more dead proteins.

c. Significance

We believe we were the first of the structural genomics centers in publishing a tracking database and integrated datamining approach and that our approach to this problem remains quite cutting edge.

We believe that our pseudogenes analysis gives a novel perspective on understanding large eukaryotic protein families -- the targets of the NESG consortium.

d. Plans

We plan to continue with the project as outlined in the original proposal. In particular, we will keep maintaining and expanding the SPINE database system, which we hope to generalize into a general system for proteomics.

We will also continue to build on our PartsList system (published in 2001) and integrate more functional genomics information onto the various fold parts.

We are particularly enthusiastic about using SPINE and PartsList as platforms to integrate information relating to protein-protein interactions. We would like to give each determined structure an "interaction-context" that describes all the potential interactions it has in the genome, as determined by many possible approaches -- e.g. two-hybrid, expression correlations, and functional roles.

An important aspect of dynamic target prioritization involves using function and/or functional genomics data to influence our target selection. The Gerstein lab is working on functional annotation of NESG targets and in some cases we will be able to use this information to group individual cluster members based on possible function, and functional differences. Looking to the future, the NESG project will need to evolve towards structural studies of sets of interacting proteins and protein-protein complexes. The Gerstein will intensify efforts to integrate functional genomics data on protein-protein interactions into NESG target selection criteria, and as a means of characterizing functional implications for NESG protein structures.

Next year, we would also like to develop a structure analysis pipeline in collaboration with the groups at Columbia that will automatically annotate many of the structures produced by the NESG project.

e. Publications

(in 2003 and in 2002 but not published in last year’s report)

(1) SPINE Database, Datamining and Ontologies for Proteomics

Strategies for structural proteomics of prokaryotes: Quantifying the advantages of studying orthologous proteins and of using both NMR and X-ray crystallography approaches.

A Savchenko, A Yee, A Khachatryan, T Skarina, E Evdokimova, M Pavlova, A Semesi, J Northey, S Beasley, N Lan, R Das, M Gerstein, CH Arrowmith, AM Edwards (2003) Proteins 50: 392-9.

SPINE 2: a system for collaborative structural proteomics within a federated database framework.

CS Goh, N Lan, N Echols, SM Douglas, D Milburn, P Bertone, R Xiao, LC Ma, D Zheng, Z Wunderlich, T Acton, GT Montelione, M Gerstein (2003) Nucleic Acids Res 31: 2833-8.

Ontologies for proteomics: towards a systematic definition of structure and function that scales to the genome level.

N Lan, GT Montelione, M Gerstein (2003) Curr Opin Chem Biol 7: 44-54.

Structural genomics: current progress.

M Gerstein, A Edwards, CH Arrowsmith, GT Montelione (2003) Science 299: 1663.

Towards a systematic definition of protein function that scales to the genome level: Defining function in terms of interactions.

N Lan, R Jansen, M Gerstein (2002). Proceedings of the IEEE 90:1848-1858 .