CELIA M.
HENRY
Large-scale data sets of protein interactions
can be very noisy and can lead to inaccuracies when trying to
identify protein interactions on a genome-wide scale. And the data
in the literature can be incomplete and contradictory. How, then,
are scientists to construct a good genome-wide picture of protein
interactions? If you're Yale University's Mark Gerstein and his
colleagues at Yale and the University of Toronto, you use Bayesian
networks to pull together disparate data sets in a single protein
interaction map.
"Bayesian networks are statistical techniques for combining lots
of sources of information in a way that optimally weights them
according to their reliability and takes into account the degree to
which the different sources of information are highly correlated and
redundant," Gerstein says. Bayesian networks are built on
probabilities.
Gerstein and his colleagues constructed two protein interaction
networks for the yeast genome [Science, 302, 449
(2003)]. In the first network, they used information that might be
only weakly associated with protein interactions to predict the
interactions. Such data include mRNA expression correlations,
biological function, and the necessity of particular proteins for
survival. They call this network PIP, for "probabilistic interactome
predicted."
In the second network, they combined four large-scale,
high-throughput data sets of protein interactions from the
literature. This network is known as PIE, for "probabilistic
interactome experimental."
"PIP is trying to predict interactions from information that's
not explicitly interaction information," Gerstein says, "whereas PIE
is simply taking the existing but noisy interaction data sets and
trying to integrate them to create an optimal experimental
interactome."
In constructing the networks, each source of information is
assessed by comparing it against a set of "gold standards" of known
positive and negative protein interactions. The positives are taken
from the Munich Information Center for
Protein Sequences catalog of known protein complexes. The
negative protein interactions include proteins that are known to be
separated in different subcellular compartments.
When the two networks are compared to the gold standards, the
predicted network turns out to be more accurate than the existing
experimental data sets. "One of the main achievements of the paper
is to predict protein interactions to a well-defined level of
accuracy from noninteraction information and show that these
predictions are essentially as accurate, if not more accurate, as
directly getting the high-throughput interaction data," Gerstein
explains.
The team verified some of the predictions using tandem affinity
purification, a technique in which a bait protein is used to fish
out interacting proteins. In three separate examples, they
identified previously unknown protein interactions of the
nucleosome, the replication complex, and Nsr1, a protein complex
associated with RNA processing.
"It is well appreciated by now that protein interaction maps
constructed from the variety of current experimental methods remain,
unfortunately, fraught with ambiguities and uncertainties," says
Douglas A. Lauffenburger, a bioengineering professor at Massachusetts Institute of
Technology who does computational modeling of biological
systems. "This new work from the Yale investigators provides a
helpful advance in providing a theoretical analysis by which the
diverse kinds of experimental data can be mostly effectively
filtered as well as integrated."