Prediction and Characterization of Non-coding RNAs in C. elegans by Integrating Conservation, Secondary Structure and High Throughput Sequencing and Array Data (incRNA)

Supplementary web site for incRNA

Zhi John Lu*, Kevin Y. Yip*, Guilin Wang, Chong Shou, LaDeana W. Hillier, Ekta Khurana, Ashish Agarwal, Raymond Auerbach, Joel Rozowsky, Chao Cheng, Masaomi Kato, David M. Miller, Frank Slack, Michael Snyder, Robert H. Waterston, Valerie Reinke and Mark Gerstein,

Prediction and Characterization of Non-coding RNAs in C. elegans by Integrating Conservation, Secondary Structure and High Throughput Sequencing and Array Data

Genome Research (Published in Advance December 22, 2010, doi:10.1101/gr.110189.110)

Datasets

Gold-standard set (bins of the four sequence element classes, in Weka .arff format)
- The cross-validation (training and testing sets) portion (in tab-delimited format)
- The independent validation set portion (in tab-delimited format)
Full dataset (all bins from the genome alignemnt of C. elegans and C. briggsae, in Weka .arff format)

Prediction results (Supplementary Files)

Training: cross-validation set, prediction: independent validation set
Training: gold-standard set, prediction: full dataset
Supplementary Files (Master tables of the high-confidence and medium-confidence ncRNA candidates)
- Supplementary File 1 (Full set): Prediction scores, structural features, sequence features and expression values for candidate ncRNA fragments/bins (10,994 bins and 7,237 fragments, in Microsoft Excel format)
- Supplementary File 2 (Intergenic portion): Intergenic candidate ncRNA loci/fragments targeted by POL II and transcription factors across different developing stages (1,678 bins and 1,223 loci, in Microsoft Excel format)

Prediction software (Java source code and compiled classes)

This software performs machine learning based on computed features of genomic regions using a set of supervised learning methods, and picks the best method according to an unbiased procedure. It does not generate the features, as the generation process involves third-party software for tasks such as sequence alignment and RNA structure prediction. The file format of datasets and a script for running the program can be found in the file readme.txt inside the packages.

With datasets (in .tar.gz format)
Without datasets (in .tar.gz format)