f.a.q.

Is there any way to download the complete set of complex data at cutoff 600?
For the TAP validation you report the ovarlap at L=300. how much smaller was it when L=600 was used as a cutoff?
How did you measure the performance of the algorithm?
Did you use commerciality available Bayesian network software?
Questions regarding the legend for figure 2
Is there a list of all interactions (no matter whether they overlap with PIP or not) found in the 98 TAP-pull-down experiments?
Explanations of the columns in the specific files
How are the gold standard positives defined?
Issues with Mac OS X
Recent Press

if you have addiotianl questions that are not covered in this faq, please submit them to Ronald Jansen

Is there any way to download the complete set of complex data at cutoff 600?

The entire network should be available from genecensus.org/intint. If you go to: http://bioinfo.mbb.yale.edu/genome/intint/supplementary.htm there are several links under "probabilistic interactomes" that contain gzipped archives of the data (the most complete being the "PIT" data). When you unzip and unpack the PIT archive, you will find one file for each yeast protein. Each of these files contains four columns. A partner protein in the first column, and then three likelihood ratios for this protein pair (i.e., the pair between the protein after which the file is named and the partner protein), first for the PIE data, then for the PIP data and finally, in the last column, for the PIT data. Thus, these data cover all (or most) of the possible protein pairs in yeast for all ranges of likelihood ratios. So, you can use these data to filter out the protein pairs that have likelihood ratios greater than a certain cutoff.

.
Return to top

For the TAP validation you report the ovarlap at L=300. how much smaller was it when L=600 was used as a cutoff

The numbers for Lcut = 600: Overlap between PIP and TAP validation:Lcut = 300 (reported in paper): 424 pairs, of which 185 are gold standard positives and 16 gold standard negativesLcut = 600: 300 pairs, of which 161 are gold standard positives and 13 gold standard negatives
.
Return to top

How did you measure the performance of the algorithm?
Specifically, in the caption of Figure 2(c), I interpret the text to mean that at the likelihood ratio threshold which gives TP/FP = 0.3, there are 183,295 pairs that exceed the threshold, and 6,179 of them are gold-standard positives (i.e. true positives). That is consistent with a sensitivity (using your definition of sensitivity as TP/P) of 75% since 6179/8250 = .75. However, that does not seem to be consistent with TP/FP = 0.3, since FP should then be roughly 177,000. It would seem that FP should be ~20,000.
Of the ~183,000 predicted protein pairs (i.e., predicted to be positive) some overlapped with the "gold-standard" positives and negatives. The intersection between the gold-standard positives and the predicted positives contained 6,179 protein pairs ("true positives", TP), whereas the intersection between the gold-standard negatives and the predicted positives contained 22,708 pairs ("false positives", FP). The second number was not reported in the paper because of space constraints. Thus, TP/FP is approximately 3/10.We could only do cross-validation of our prediction results with protein pairs for which it is already known (with some reasonable level of certainty) whether they are interacting or not ("gold standards"). These represent only a sample of the total number of interacting and non-interacting protein pairs, which is much larger. The remaining predicted protein pairs (~183,000 - 6,179 - 22,708) represent new predictions that need to be verified by other (experimental) methods. A first order estimate of the accuracy of these new predictions is that there will also be roughly 3 true in 10 false predictions (by analogy to the sample on which we could do cross-validation). But obviously this remains to be verified by future experiments. We were using 8,250 gold standard positives. So, the 6,179 proteins pairs in the intersection between the gold standard positives and the prediction represent 75% sensitivity (in our definition).
.
Return to top

Did you use commerciality available Bayesian network software?

We didn't use any commercially available tools. Most of the computations involved comparing the various datasets with the "gold standard" references for which we wrote our own scripts. The resulting parameters for the Bayesian networks can be represented in a spreadsheet table (such as those in our supplementary material).I briefly looked into a tool called "WinMine", which was made available by Microsoft Research. A general overview of Bayesian networks software can be found here (I didn't write it, but it seems quite good to me):http://www.ai.mit.edu/~murphyk/Software/BNT/bnsoft..html
.
Return to top

Questions regarding figure 2: There're a couple of things that I'm not very clear about and I'd appreciate if you could help me understand them. In Figure 2 legend, it says "The arrow shows the difference in sensitivity at TP/FP = 0.3. At this level, the PIP contains 183,295 protein pairs, of which 6179 are gold-standard positives (75% sensitivity), whereas the PIE contains 31,511 protein pairs and 1758 gold-standard positives among these (21% sensitivity)." I'm a little confused by the numbers. If I understood it correctly, for PIP, #TP = 6179 and #{predictions} = 183,295, therefore #FP = #{predictions} - #TP = 183,295 - 6179 = 177,116, and then TP/FP = #TP/#FP = 6179/177116 = 0.035, which is different from 0.3 as described [in the figure legend]. Also for PIE, #TP = 1758 and #{predictions} = 31,511, therefore #FP = #{predictions} - #TP = 31,511 - 1758 = 29,753, and then TP/FP = #TP/#FP = 1758/29753 = 0.059, which is again different from 0.3. Could you please tell me if I'm missing something here? If not, perhaps you could provide a corrected version of figure 2?
Quick answer is that we defined "negatives" in a different way.In the "gold standards" we have both a set of positives and a set of negatives -- which are interacting or non-interacting proteins with a reasonable level of certainty -- and we do cross-validation only on those. They represent only a sample of the total number of interacting and non-interacting protein pairs, which is much larger. The negatives were derived from proteins that have different subcellular localization with the assumption that these are much less likely to form complexes with each other than those proteins that share the same localizations.In the example you mention, of the ~183,000 protein pairs (predicted to be positive) some overlapped with the "gold-standard" positives and negatives. The intersection between the gold-standard positives and the predicted positives contained 6,179 protein pairs ("true positives", TP), whereas the intersection between the gold-standard negatives and the predicted positives contained 22,708 pairs ("false positives", FP). The second number was not reported in the paper because of space constraints. Thus, TP/FP is approximately 3/10. The remaining predicted protein pairs (~183,000 - 6,179 - 22,708) represent new predictions that need to be verified by other (experimental) methods. A first order estimate of the accuracy of these new predictions is that there will also be roughly 3 true in 10 false predictions (by analogy to the sample on which we could do cross-validation). But obviously this remains to be verified by future experiments.
.
Return to top

On page 451, 3rd paragraph on the right column (starting with "to further test."), it describes the success evidenced by "pull down" experiment. Is it possible that you can provide a list of all interactions (no matter whether they overlap with PIP or not) found in the 98 TAP-pull-down experiments?
I will post a dataset that contains the TAP-tagging experiments. It will contain those results that overlapped with the PIP (because I have this data available myself). I am not sure if I will be able to post the remaining data since these were provided to us by the generosity of the Greenblatt lab at the University of Toronto, and I have to check back with them first. I have let them know that people have requested their data. It may be possible that they are planning to publish their data in another paper that's just around the corner.
.
Return to top

What does each column in the files under L_PIP/, L_PIE/, L_PIT/ mean? 2. What do the third columns in go.txt, mips.txt and ES.txt mean? 3. What does each column in pos_MIPS_complexes.txt mean?
in L_PIP, L_PIE and L_PIT, the file names themselves stand for the first protein, and the second protein is listed in the first column of the files. The second column in each file is the likelihood ratio, i.e. the following ratio of conditional probabilities:L = P(data | given that protein 1 and protein 2 are in the same complex) / P(data | given that protein 1 and protein 2 are in different subcellular localizations)The "data" in the PIP are four genomic datasets, and in case of the PIE four high-throughput interaction datasets. In case of the PIT its all these aforementioned datasets together (as explained in the Science paper).In go.txt and mips.txt the third column is a "functional similarity" count. It's explained in the supplementary material how we compute this. Basically, we first compute which functional classes the two proteins share in either GO (biological process) or the MIPS functional catalog. Then we count how many other protein pairs share exactly the same set of functional classes. This is the count in the third column. In general, the larger this count of pairs, the less specific is the functional relationship between two proteins (for example, there are lots of proteins in the general class "metabolism", thus the functional similarity count would probably be very large, whereas there are very few proteins in, let's say, "leucine catabolism", thus making the functional similarity count very small).In ES.txt the third column is either "EE", "EN" or "NN". "E" stands for essential and "N" for non-essential. For instance, when both proteins are essential, then third column is "EE", and so on.In "pos_MIPS_complexes.txt", the first two columns are protein names (or rather ORF names), the next column is an ID for the protein complex in which these two proteins are present (this ID stems from the MIPS complexes catalog), the next column is the name of the complex, and the last column simply tells you how many proteins there are in the complex. For instance, in the Calcineurin B complex there are 3 different proteins. This is only a subset of the MIPS complexes catalog because we removed certain classes that do not relate to one specific complex (such as the class "other transcription complexes").
.
Return to top

How are the gold standard positives defined . For, e.g., the cytoplasmic ribosome small subunit, do you count any pair of proteins from the complex as interacting? If so, then your gold standard includes many pairs that participate in the same complex but do not interact directly. As such, the yeast two-hybrid methods would not be expected to pick up many of these pseudo-interactions.
The pairs in the gold standard positives are a sample of protein pairs that are in the same complex; it doesn't necessarily mean that they interact physically.So the pull-down methods (TAP-tagging etc.) would really be the appropriate experiments for this situation. The yeast two-hybrid experiments on the other hand should really only capture physical interactions, which are a subset of the pairs of proteins in the same complex.Indeed, the yeast two-hybrid experiments have the lowest sensitivity with respect to the gold standard positives -- for instance, there are 50 pairs in the Uetz et al. data that are in the gold standard positives, whereas the corresponding numbers for the pull-down datasets are 464 (Ho et al.) and 1743 (Gavin et al.). (You can derive these numbers from supplementary table S2 in the supplementary material).On the other hand, the Uetz data contains the highest ratio of gold standard positives to negatives (what we call "TP/FP" ratio in the paper) with 50/148 =~ 0.34, whereas these ratios are 1743/6445 =~ 0.27 for Gavin et al. and 464/5786 =~ 0.08 for Ho et al.So, in a sense, the Uetz data is more accurate, but less sensitive, than the pull-down data. The numbers are very similar for the Ito core data.
.
Return to top

I am a Mac OSX user and don't know if that is the problem or not. But I can only get one set of images when I do a search. After that, all subsequent searches give a blank page. I can click on your premade groupings (e.g. exosome) but I only get one shot at querying your database by ORF. I get an error message that I have not chosen one of the parameters some times and other times I just get a blank page (it depends on which browser I am using - I have tried Safari, Netscape and IE).
The search does NOT work with the Safari (version 1.1) browser. In that case, it seems that already the first search fails, producing an error message "Did you forget to input the starting node..." -- this seems to happen because the form doesn't pass on the data to the CGI script.
.
Return to top

BAYESIAN NETWORKS PREDICT PROTEIN INTERACTIONS
Statistical method does better job than high-throughput experimental data
by: CELIA M. HENRY
Chemical and Engineering News