White Lab ENCODE/modENCODE/modERN readme
Version 1.0
Last updated by Alec Victorsen avictorsen@uchicago.edu and Lijia Ma
ljma@uchicago.edu 04/13/2015

In addition to this readme, please look at the ReadMe.pdf with slides created by Jay Rhem.

Some file descriptions are taken from Anshul’s website: 
https://sites.google.com/site/anshulkundaje/projects/idr#TOC-Intuitive-Explanation-of-IDR-and-IDR-plots

The general procedure of our ChIP-seq pipeline is to generate a consistent peak set from 
multiple biological replicates. We start with alignment of both IP and input fastq files 
to the reference genome (BWA), generate peaks and signal track (SPP) and examine reproducibility 
between biological replicates (IDR). 

In detail, peaks are produced with relax parameter for each biological replicate, merged data 
and each pseudo replicates. Following peak calling, peak consistencies are examined between 
biological replicates and between pseudo replicates. A final peak set is determined from peaks 
called from merged data according to a threshold obtained from peak consistency test


----------------------------------------------------------------
Files required to run pipeline
----------------------------------------------------------------
1. table file: usually table.txt containing information such as replicate, factor and fastq file name

2. config file: usually dm_DATE.config.  This file contains information required to generate
   the main shell script to execute pipeline.  The current version does not require this file 
   and is integrated into the script generator.


----------------------------------------------------------------
Main files generated by pipeline
----------------------------------------------------------------
1. IDR log: named either “species_IDR_date.log” or “genome_IDR.log”.  This file contains 
   the number of peaks above IDR either between pseudo-replicates or between IPs.  

2. Cross-correlation file: *.cc.  This file contains info about fragment length.  The 
   ENCODE parameter that the NSC should be above 1.05 and RSC should be > 1.  According to 
   Anshul Kundaje, this does not hold true for organisms with smaller genomes than Human.

3. SAM stats:  This file contains information regarding the quality of the alignment and 
   uniqueness of the reads.  The q30 files are after filtering the aligned fragments based 
   on a quality score of 30.

4. Bed files: These are stored in the ./bed/ folder and contain the called peaks for each
   replicate, the divided pseudo replicates, the pooled Rep0 replicate, as well as the optimal
   and conservative called peaks for each TF.  The spp.optimal files are what should be used 
   as the resulting peaks from our TF data.

5. Wig files: These are stored in the folder ./wig/.  Each TF has several wigs generated 
   from it.  Each replicate, including rep0 and the input data, have the following wigs, density
   and density scaled.  The density scaled is the original density file scaled by something?  
   For each IP, a background subtracted wig is also generated named with *VS*.


----------------------------------------------------------------
Other generated files
----------------------------------------------------------------
1. batch-consistency-analysis.r: required for Anshul’s IDR script. Output puts are *.sav, 
   *.Rout, *.npeaks-aboveIDR.txt, and *.overlapped peaks.
2. sav files: output file from batch-consistency-analysis.r.
3. overlapped-peaks.txt: output from Anshul’s batch-consistency-analysis.r script. The full 
   set of peaks that overlap between the replicates.
4. npeaks-aboveIDR.txt: output from Anshul’s batch-consistency-analysis.r script.  The number
   of peaks that pass specific IDR threasholds.
5. Rout.txt: output from Anshul’s batch-consistency-analysis.r script.
6. plot.ps: output from Anshul’s batch-consistency-analysis.r script.  They generate the IDR 
   plots to find significant peaks.

7. batch-consistency-plot.r: required for Anshul’s IDR script.
8. functions-all-clayton-12-13.r: required for Anshul’s IDR script.
9. genome_table.txt: required for Anshul’s IDR script.

10. sai files: binary files generated from BWA alignment of fastq to genome.
11. sam files: Generated from sai files using BWA samse to generate alignment sequences.  If 
    q30 is located within the name, they have been filtered to remove sequences with low 
    alignment scores.

12. bam: bam files are only generated for input.
13. tagAlign.gz: bed formatted files, filtered sam files are converted to bed, and line names 
    are replaced by “N”.
15. tagAlign.pdf:  These are the cross-correlation plots whose data is displayed in the *.cc 
    file.

16. rmblacklist files: see Anshul’s discussion of blacklists at:
     https://sites.google.com/site/anshulkundaje/projects/blacklists.

17. regionPeak.gz: output files from SPP.

18. nohup.out: STDOUT from running pipeline.

19. run_DATE.sh or run.sh or run_pipeline.sh: actual pipeline script.

20. download.sh: script to download the fastq files from bionimbus server.

21. report.sh: used to create idr table log.