White Lab ENCODE/modENCODE/modERN readme Version 1.0 Last updated by Alec Victorsen avictorsen@uchicago.edu and Lijia Ma ljma@uchicago.edu 04/13/2015 In addition to this readme, please look at the ReadMe.pdf with slides created by Jay Rhem. Some file descriptions are taken from Anshul’s website: https://sites.google.com/site/anshulkundaje/projects/idr#TOC-Intuitive-Explanation-of-IDR-and-IDR-plots The general procedure of our ChIP-seq pipeline is to generate a consistent peak set from multiple biological replicates. We start with alignment of both IP and input fastq files to the reference genome (BWA), generate peaks and signal track (SPP) and examine reproducibility between biological replicates (IDR). In detail, peaks are produced with relax parameter for each biological replicate, merged data and each pseudo replicates. Following peak calling, peak consistencies are examined between biological replicates and between pseudo replicates. A final peak set is determined from peaks called from merged data according to a threshold obtained from peak consistency test ---------------------------------------------------------------- Files required to run pipeline ---------------------------------------------------------------- 1. table file: usually table.txt containing information such as replicate, factor and fastq file name 2. config file: usually dm_DATE.config. This file contains information required to generate the main shell script to execute pipeline. The current version does not require this file and is integrated into the script generator. ---------------------------------------------------------------- Main files generated by pipeline ---------------------------------------------------------------- 1. IDR log: named either “species_IDR_date.log” or “genome_IDR.log”. This file contains the number of peaks above IDR either between pseudo-replicates or between IPs. 2. Cross-correlation file: *.cc. This file contains info about fragment length. The ENCODE parameter that the NSC should be above 1.05 and RSC should be > 1. According to Anshul Kundaje, this does not hold true for organisms with smaller genomes than Human. 3. SAM stats: This file contains information regarding the quality of the alignment and uniqueness of the reads. The q30 files are after filtering the aligned fragments based on a quality score of 30. 4. Bed files: These are stored in the ./bed/ folder and contain the called peaks for each replicate, the divided pseudo replicates, the pooled Rep0 replicate, as well as the optimal and conservative called peaks for each TF. The spp.optimal files are what should be used as the resulting peaks from our TF data. 5. Wig files: These are stored in the folder ./wig/. Each TF has several wigs generated from it. Each replicate, including rep0 and the input data, have the following wigs, density and density scaled. The density scaled is the original density file scaled by something? For each IP, a background subtracted wig is also generated named with *VS*. ---------------------------------------------------------------- Other generated files ---------------------------------------------------------------- 1. batch-consistency-analysis.r: required for Anshul’s IDR script. Output puts are *.sav, *.Rout, *.npeaks-aboveIDR.txt, and *.overlapped peaks. 2. sav files: output file from batch-consistency-analysis.r. 3. overlapped-peaks.txt: output from Anshul’s batch-consistency-analysis.r script. The full set of peaks that overlap between the replicates. 4. npeaks-aboveIDR.txt: output from Anshul’s batch-consistency-analysis.r script. The number of peaks that pass specific IDR threasholds. 5. Rout.txt: output from Anshul’s batch-consistency-analysis.r script. 6. plot.ps: output from Anshul’s batch-consistency-analysis.r script. They generate the IDR plots to find significant peaks. 7. batch-consistency-plot.r: required for Anshul’s IDR script. 8. functions-all-clayton-12-13.r: required for Anshul’s IDR script. 9. genome_table.txt: required for Anshul’s IDR script. 10. sai files: binary files generated from BWA alignment of fastq to genome. 11. sam files: Generated from sai files using BWA samse to generate alignment sequences. If q30 is located within the name, they have been filtered to remove sequences with low alignment scores. 12. bam: bam files are only generated for input. 13. tagAlign.gz: bed formatted files, filtered sam files are converted to bed, and line names are replaced by “N”. 15. tagAlign.pdf: These are the cross-correlation plots whose data is displayed in the *.cc file. 16. rmblacklist files: see Anshul’s discussion of blacklists at: https://sites.google.com/site/anshulkundaje/projects/blacklists. 17. regionPeak.gz: output files from SPP. 18. nohup.out: STDOUT from running pipeline. 19. run_DATE.sh or run.sh or run_pipeline.sh: actual pipeline script. 20. download.sh: script to download the fastq files from bionimbus server. 21. report.sh: used to create idr table log.