C&EN: TODAY'S HEADLINES - BAYESIAN NETWORKS PREDICT PROTEIN INTERACTIONS

CELIA M. HENRY

Large-scale data sets of protein interactions can be very noisy and can lead to inaccuracies when trying to identify protein interactions on a genome-wide scale. And the data in the literature can be incomplete and contradictory. How, then, are scientists to construct a good genome-wide picture of protein interactions? If you're Yale University's Mark Gerstein and his colleagues at Yale and the University of Toronto, you use Bayesian networks to pull together disparate data sets in a single protein interaction map.

"Bayesian networks are statistical techniques for combining lots of sources of information in a way that optimally weights them according to their reliability and takes into account the degree to which the different sources of information are highly correlated and redundant," Gerstein says. Bayesian networks are built on probabilities.

Gerstein and his colleagues constructed two protein interaction networks for the yeast genome [Science, 302, 449 (2003)]. In the first network, they used information that might be only weakly associated with protein interactions to predict the interactions. Such data include mRNA expression correlations, biological function, and the necessity of particular proteins for survival. They call this network PIP, for "probabilistic interactome predicted."

In the second network, they combined four large-scale, high-throughput data sets of protein interactions from the literature. This network is known as PIE, for "probabilistic interactome experimental."

"PIP is trying to predict interactions from information that's not explicitly interaction information," Gerstein says, "whereas PIE is simply taking the existing but noisy interaction data sets and trying to integrate them to create an optimal experimental interactome."

In constructing the networks, each source of information is assessed by comparing it against a set of "gold standards" of known positive and negative protein interactions. The positives are taken from the Munich Information Center for Protein Sequences catalog of known protein complexes. The negative protein interactions include proteins that are known to be separated in different subcellular compartments.

When the two networks are compared to the gold standards, the predicted network turns out to be more accurate than the existing experimental data sets. "One of the main achievements of the paper is to predict protein interactions to a well-defined level of accuracy from noninteraction information and show that these predictions are essentially as accurate, if not more accurate, as directly getting the high-throughput interaction data," Gerstein explains.

The team verified some of the predictions using tandem affinity purification, a technique in which a bait protein is used to fish out interacting proteins. In three separate examples, they identified previously unknown protein interactions of the nucleosome, the replication complex, and Nsr1, a protein complex associated with RNA processing.

"It is well appreciated by now that protein interaction maps constructed from the variety of current experimental methods remain, unfortunately, fraught with ambiguities and uncertainties," says Douglas A. Lauffenburger, a bioengineering professor at Massachusetts Institute of Technology who does computational modeling of biological systems. "This new work from the Yale investigators provides a helpful advance in providing a theoretical analysis by which the diverse kinds of experimental data can be mostly effectively filtered as well as integrated."