Intergenic Annotation. The Gerstein lab has been a major participant in a number of ENCODE efforts, focusing on intergenic annotation. We developed an approach to analyze the distribution of regulatory elements found in many different ChIP-chip experiments (Zhang et al., 2006). We focused on the overall chromosomal distribution of regulatory elements in the encode regions and showed that it is highly nonrandom. Our results indicate that these elements are clustered into regulatory rich ‘islands’ and poor ‘deserts.’ We then performed a multivariate analysis on all the factors collectively. This grouped the transcription factors into sequence-specific and sequence-nonspecific clusters.  Following up on this, we developed an approach for integrating the results of many ChIP-chip experiments to discover new promotors in the human ENCODE regions (Trinklein et al. 2006). We also have carried out comprehensive pseudogene annotation of the human genome and cross referenced this annotation with tiling arrays (Zheng et al., 2005, 2007; Harrison et al., 2005; Zheng et al., 2006; Zheng & Gerstein, 2006, 2007; Zhang et al., 2003). We performed a related analysis of structured RNAs in intergenic regions, also inter-relating them with tiling array data (Zhang et al., 2007; Washietl et al., 2007).

Tiling Array Tools. The Gerstein Lab has developed a considerable amount of tools and machinery for processing tiling arrays. Most of these are described elsewhere in the proposal (e.g. scoring arrays, sect. C.4). One tool not described elsewhere is BoCaTFBS (Wang et al., 2006). This refines ChIP-chip hits by considering known motifs (Wang et al., 2006). Traditional computational algorithms used to identify binding sites, such as consensus sequences (Osada et al., 2004), profile methods (PWM/PSSM) (Stormo, 2000), and HMMs (Pavlidis et al., 2001; Ellrott et al., 2002) generate high false-positive rates when applied genome-wide (Wasserman & Sandelin, 2004). Our method uses a boosted cascade of classifiers -- specifically, alternating decision trees (ADTboost) (Freud & Schapire, 1999), where ADTboost is a special extension of AdaBoost. We use the known motifs (e.g. from the TRANSFAC database, Matys et al., 2003) as positives and the results of the ChIP-chip experiments as negatives. Our method is the first motif finder that explicitly takes into account the data from ChIP-chip experiments. Moreover, BoCaTFBS differs from most other motif programs in that (1) it takes into consideration positional dependencies within a given motif and (2) it uses the negative data (regions where the transcription factor does not bind) in order to refine the binding site.

Regulatory Networks. The Gerstein lab has done quite a number of analyses on the large-scale structure of regulatory networks (e.g. Yu & Gerstein, 2006; Luscombe et al., 2004; Yu et al., 2003, 2004) and has developed tools to enable their analysis (Yip et al., 2006; Yu et al., 2004; ).



C4.2. Analysis of the ChIP samples

C4.2a.  Comparison of ChIP-chip Platforms and Parameters. At the outset of the ENCODE pilot phase, both PCR product arrays and high density oligonucleotide (HDO) arrays were being used by various groups. However, the overall performance and suitability for genome-scale analyses had not ever been compared. Therefore, we began by comparing PCR arrays and 390,000 feature arrays prepared by maskless photolithography (e.g. NimbleGen) by performing ChIP-chip for three yeast factors (Borneman et al., 2006b). We found that the HDO arrays detected approximately three times more binding events than the PCR arrays while also showing increased accuracy. We also investigated optimal parameters for mapping binding sites in mammalian cells using ChIP-chip (Euskirchen et al., 2007).  We compared parameters such as DNA microarray format (oligonucleotide versus PCR), oligonucleotide length, hybridization conditions, and the use of competitor Cot DNA.  Also, methods were devised for scoring and comparing results.  Optimal signal-to-noise, sensitivity and specificity was observed with HDO arrays relative to PCR product arrays (Figure 2).  We observed optimal signals using oligonucleotides >36 b and the presence of Cot competitor DNA (See Figure 2 for the oligonucleotide length results).  Consistent with the better performance of arrays containing longer oligonucleotides, we were not able to identify STAT1 targets using Affymetrix arrays (which uses 25 mers).  Target identification as a function of biological replicates was also determined and revealed that 80-86% of targets were identified with three experimental repeats; four provides a modest 3-5% increase (Euskirchen et al., 2007). We also did a parallel study as the part of the ENCODE consortium to compare platforms for the suitability for assaying transcription (Emanuelsson et al., 2006).  Overall, our conclusions were similar to those in Euskirchen et al. (2007) -- that is, HDO arrays  performed better than PCR ones and feature density was all important.  However, for transcript mapping we got better performance with 25mers than 36mers. Thus, the HDO array platform provides a far more robust array system by all measures than PCR-based arrays, all of which is directly attributable to the large number of probes available. Since we performed this study Agilent has also begun to manufacture HDO arrays.  However their highest density array (244K features) and price ($500) does not make them competitive with the Nimblegen whole genome 10 array set (2.1 million features/array). Affymetrix arrays are competitive in price ($1500 for one set). However, some of our team members have found that they are not as effective as NimbleGen arrays when analyzing certain factors and they cannot be reused (Nimblegen arrays can be used at least twice and new chemistry that should increase the number of rehybridizations/array is currently being tested by one of our team members in collaboration with NimbleGen).  Thus, at this time Nimblegen arrays provide the highest sensitivity and the most value of the array platforms


Figure 2 Comparison of ChIP-chip results using different array platforms and oligonucleotide lengths.  Raw signals from each oligonucleotide are plotted.  Key: PCR: PCR product array.  36-36 36 b oligonucledotides arrayed end to end; 50-50: 50 b oligonucledotides arrayed end to end; a-f validated positive regions - negative control region.


C4.2b Comparison of ChIP-chip with ChIP-Sequencing. We have also compared ChIP-chip to ChIP followed by DNA sequence analysis of fragment ends (ChIP-PET; Ng et al., 2006) for the STAT1 transcription factor (Euskirchen et al., 2007).  We found that these two technologies were comparable in terms of sensitivity and specificity (Figure 3). Seven of 8 targets enriched 4-fold or more were found with both methods; 4 fold is approximately the threshold at which ChIP targets reproducibly validate by qPCR (see below). The new 2.1 million feature array design contains more probes in repeated regions and is expected to identify all 8 (Euskirchen et al 2006). The resolution of ChIP chip is generally higher (targets can be mapped and validation to within 200 bp (K.S., P.R., M. S., unpublished; see also XXX) than ChIP PET, particularly on the less enriched peaks (see figure 3). At this point, ChIP-PET is also considerably more expensive, even with the new 454 sequencing technologies, than analysis of samples using the NimbleGen 10 array set. However, we will continue to compare ChIP-chip and ChIP-sequencing during the production phase and always make sure that we are using the technique that provides the highest quality and most comprehensive data for the lowest price.



Figure 3. Comparison of ChIPchip and ChIP-PET results.  Raw signals from each method are plotted.  The concordance of the results for both signals (shown above) and called target regions is quite high (greater than 90% for the top targets (estimated to be 4 fold enriched).



C5.  Bioinformatics - Processing pipeline for target identification and integration of primary data (ChIP-chip) with secondary validation data.

We have extensive experience with the informatics for carrying out tiling array experiments. This comprises designing the arrays, developing scoring approaches, comparing platforms, constructing high-throughput pipelines for the Nimblegen data, and building a special purpose database for handling the results the experiments.


C5.1. Array Processing Platforms -- ExpressYourself & Tilescope

We have developed two generations of tiling arrays software: ExpressYourself for PCR arrays (Luscombe et al., 2003; and Tilescope for oligo arrays (Zhang et al., 2007;  Tilescope is an automated data processing pipeline for analyzing data sets generated in experiments using high-density tiling microarrays. The software performs data normalization, combination of replicate experiments, tile scoring, and feature identification. Given the modular architecture of the pipeline, new analysis algorithms can be readily incorporated. Tilescope is capable of handling very large data sets, such as ones generated by whole genome ChIP-chip experiments, as we developed an efficient new data compression algorithm to reduce the data size for fast online data transmission. Tilescope is designed with a graphical user-friendly interface to facilitate a user’s data analysis task, and the results, presented on a web page, can be downloaded for further analysis.


C5.2. Approaches for Array Scoring and Artifact Correction

Tilescope incorporates a number of scoring and artifact correction approaches we have developed for tiling arrays. In particular, for Nimblegen arrays, we identified those procedures in the microarray normalization literature that are transferable to tiling arrays and those that needed to be modified or replaced (Royce et al., 2005, 2006).  One issue is that a small fraction of probes on an array experience significant levels of nucleic acid hybridization; many normalization procedures assume hybridization to at least half of all probes on an array.  Another issue is that tiling arrays for whole genomes, for example, require many arrays - each of different design.  Care must be taken when normalizing these arrays to one another because the underlying distribution of measured signals may vary widely among arrays tiling different regions of the genome. We have also worked extensively on compensating for issues of cross-hybridization. In Royce et al. (2007) we developed a correction system for non-specific, sequence-based cross hybridization that is readily applicable to tiling arrays. We have also developed tools for the optimal design of tiling arrays to minimize the effect of repetitive and specifically cross-hybridizing probes (Bertone et al., 2006). These algorithms are implemented as a collection of web-based tools, available at Finally, in processing large amounts of expression data it is important to be able to discover and correct various spatial array artifacts. Qian et al. (2003), Kluger et al. (2003), and Yu et al., (2007) developed a way of measuring and quantifying a common spatial artifact that can induce spurious correlation of nearby spots on the array.


C5.3. Tools to integrate the list of functional elements -- DART

         We have developed DART (, Rozowsky et al., 2007) to facilitate the flexible storage, visualization, and comparison of the growing number of experimentally defined sets of transcription factor binding sites. DART has been designed to address a number of challenging issues that arise when attempting to analyze, compare, and store these types of data.  The key aspects of DART include the following: Dealing with heterogeneous datasets, flexibility for storing different tiling array hit attributes, accommodating new genome builds, and integrated linking to other web resources for broader visualization and analysis. Furthermore, DART provides machinery for helping to pick targets for validation.   



H. Literature Cited

