Science -- Gerstein 288 (5471): 1590b

dEbates: Submit a response to this article

Annotation of the Human Genome

The News article "Are sequencers ready to 'annotate' the human genome?" by Elizabeth Pennisi (special issue on the Drosophila Genome, 24 Mar., p. 2183) is especially timely and provocative. Pennisi mentions two ideas: a small group gathering at a centralized annotation jamboree, or a distributed, Web-based system that would allow anyone to contribute annotations with a "smart browser" that would merge all efforts. I favor the essence of the second proposal because it provides a more democratic and more "biological" approach to an all-important problem.

There is, however, a third approach for annotating the human genome (providing at least the putative start, stop, and structure of each gene) that is, in a sense, already extant: extend the capabilities of the biological science literature. The current journal system is decentralized, yet most research articles adhere to common standards that make them ideal for annotation: (i) Each article associates a bit of annotation with a distinct time and place and with specific, responsible parties. (ii) Attentive scholarly referencing and footnoting provide a way to connect bits of annotation and allow for continuous "updates." (iii) Peer review and editing provide a proven quality-control mechanism. (iv) Publication is an established indicator of scientific productivity; consequently, scientists already have an incentive to provide the information, whereas database submissions are often regarded as a chore.

The main drawback of current journal article formats is that they are not very "computer-parseable," or suitable for bulk annotation of thousands of genes. However, by adding sections of highly structured text to each article (that is, extended keywords and using a controlled vocabulary) and linking subparts of an article to relevant database identifiers, one can envision how a "literature annotation standard" could readily be interpreted by computers. Furthermore, if an article could be linked to a large "supplementary materials" data file with simple annotations for many genes (for example, lists of all the membrane proteins in the Caenorhabditis elegans genome), one would have a mechanism for bulk annotation. Further standardization could be achieved if the article described defined ways in which the data file might be updated over time and if the supplementary materials were refereed and evaluated with the text of the article.

Mark Gerstein
Department of Molecular Biophysics and Biochemistry,
Yale University,
New Haven, CT 06520- 8114, USA.
E-mail: mark.gerstein@yale.edu

dEbates: Submit a response to this article