NBC year 1 progress report

6. Progress Report Summary- Biomedical Informatics Core

1.0 A tool to rapidly characterize pathogen genomes in high throughput pipeline

PI: Mark Gerstein (Yale)

Milestones:
Predicting phenotypes and pathogenicity from known genomes
Gerstein and Lussier have invested a significant amount of time in building a prediction system over clusters of Genes (COGs) and phenotypes (GIDEON database). It also illustrates that the informatics work products are provided from increasingly intertwined collaborations between the inter institutional research groups.

Predicting Essential Genes in S. cerevisiae
We have integrated over fifteen genome-scale characteristics in S. cerevisiae, and generated a modest predictive framework. Within three months, we aim to have a robust predictive system, capable of correctly identifying essential genes with 80% frequency, based solely on genomic and sequence data. This system will then be applied to bacterial systems such as E.coli, and pathogenic eukaryotic microbes such as C. albicans. Within six months we aim to have preliminary predictive system running on these two platforms.

Predicting Pathogens Pseudogenes

· Identification and characterization of pseudogenes in bacteria pathogens and other prokaryotes on genomic scale.

· A broad range of bacteria genomes, including 4 agents in CDC’s category A, B and C list (Escherichia coli O157:H7,Vibrio cholerae, Brucella melitensis, Yersinia pestis ) and many other pathogens, have been selected with archaea genomes. A comprehensive method has been developed to perform genome wide analysis and identification of pseudogenes.

· A total of about 7000 pseudogenes have been identified in 64 genomes studied. The identified pseudogenes occur in at least 1 to 5% of all gene-like sequences in prokaryote genomes. The pseudogenes have been classified by their functional categories. Although many large populations of pseudogenes arise from large, diverse protein families (for example, the ABC transporters), notable numbers of pseudogenes are associated with specific families that do not occur that widely. These include the cytochrome P450 and PPE families (PF00067 and PF00823) and others that have a direct role in DNA transposition.

· It was also demonstrated that a large fraction of prokaryote pseudogenes arose from failed horizontal transfer events. In particular, we find that pseudogenes are more than twice as likely as genes to have anomalous codon usage associated with horizontal transfer. Moreover, we found a significant difference in the number of horizontally transferred pseudogenes in pathogenic (O157:H7) and non-pathogenic strain (k12) of Escherichia coli.

Publications:
· Y Liu, Harrison PM, Kunin V, Gerstein M. Comprehensive analysis of pseudogenes in prokaryotes: widespread gene decay and failure of putative horizontally transferred genes. Genome Bio 2004;l 5: R64.

· H Yu, D Greenbaum, H Xin Lu, X Zhu, M Gerstein. Genomic analysis of essentiality within protein networks. Trends Genet 2004 20: 227-31. Review.