README Author: Tara Gianoulis Last Modified: Nov. 22, 2008 This is the README for Gianoulis, et al., PNAS 2009 This work generated a large amount of data. Many of these underwent further processing. Here we include both the raw data dump and a number of utility scripts to facilitate additional analyses. Please note: the BLAST tables are in the unfiltered stage. This was the most computationally intensive portion; we provide it here to help prevent unnecessary duplication of work. This directory is split into three subdirectories. (A) RawData - Raw data dump (B) Scripts - Utility Scripts (C) ProcessedData - Processed data (A) RAW DATA DUMP. (1) Description: Mapping Peptides to Sample Ids through Scaffolds Name: mappingPeptidetoSite.txt FileFormat: tab-delimited Format: PEP ORF SMPL SCAF (2)Description: Mapping Peptides to KEGG Name: mappingGOSBlast.txt FileFormat: tab-delimited Format: BLAST Tab delimited header (3) Description: Normalized frequency counts of COGs, KEGG and module pathways, and operons in each sample (this then underwent further processing, see utility scripts ) FileFormat: tab-delimited Format: In all cases the format is: COLUMNHEADER - GOS ids ROWHEADER - COGID, KEGGID, or OPERONID, respectively with its associated description, in the form COGXXXX(Description) Elements - count of COG, KEGG, or OPERON in each GOS id (see text for normalization procedure) TYPE: COG: COG_countpersite.txt KEGG: KEGG_countpersite.txt MODULE: MODULE_countpersite.txt OPERON: OPERON_countpersite.txt (4) Description: Metadata with added missing data Name: gos_metadata_renamed.tara_interpolation_added.txt.csv FileFormat: tab-delimited (B) Utility Scripts (1) Color-coding ipath maps based on some number Name: mapfromColorGradient_moreGeneral.m (2) Metrics to evaluate results of CCA Name: cca_evalMetrics.R (3) Metrics to evaluate cluster membership Name: parse.py, randindex.py, nmi.py Further Info for Running (3) program files: parse.py: parses the clustering input file. It's expecting something like: 1 1 3 2 3 1 3 1 1 4 3 1 5 1 1 Any whitespace will work as a separator. You can optionally have a header line that describes the columns, in which case you need to modify the call to parseFile() to set the parameter hasheaders=True. The sample names and cluster ids can be any string. randindex.py and nmi.py implement the actual methods. To run an example on $filename with 1000 bootstraps, python randindex.py $filename 1000 EXAMPLE OUTPUT: 0.575075075075 0.261 python nmi.py $filename 1000 EXAMPLE OUTPUT: 0.0805628617186 0.232 The output is the actual value, followed by the p-value. (C) Processed Data - tab delimited versions of supplementary materials (1) Decription: Results for BenjHochbergi Corrected Spearman Correlations Name: all_pairwisecorr.txt FileType: Tab delimited FileFormat: EnvironmentalParameter Id Description R2 correctedpval (2) Description: Significant models predicting metabolism from environment Name: all_LM_predictMetabolismfromEnv.txt FileType: Tab delimited FileFormat: factor adjR2 pval term coeff_sign_code coef_est_pval sig_level (3) Description: Significant models predicting environment from metabolism Name: all_LM_predictEnvfromMetabolism.txt FileType: Tab delimited FileFormat: factor adjR2 pval term coeff coeff_est_pval coeff_est_sign_code sig_level (4) Description: DPM Results Name: all_DPM.txt FileType: Tab delimited FileFormat: see DPM_README.txt (5) Description: Structural Correlations from CCA Name: all_regCCA.txt FileType: Tab delimited FileFormat: Id StructCorrDim1 StructCorrDim2