text_figures[1]

Greenbaum et al 7 Then ( ) i i i U Y V + = 2 1 Else if only Y_i exists, V_i = Y_i Else V_i = U_i As presented above, where only one data set has a value for the corresponding ORF, we incorporated that value and did not exclude it. When both data sets have values for an ORF, we averaged the values if they were within 15% of each other; otherwise, we just stayed with the original chip data set U_i. We used a = 15% in order to prevent outliers from skewing the result. This 15% value is a reasonable threshold for excluding outliers though other values (e.g. 10% or 20%) would give similar results (data not shown). Other data sets are subsequently included in the same procedure, continuing the iteration from the new expression values V_i. The initial iteration starts with the Young Expression Set as U_i since we have the highest confidence in its accuracy. The SAGE data was not included in the above procedure since it is of a fundamentally different nature. An advantage of the SAGE technology over Gene Chips is that there is no possible signal saturation for high expression levels, as is possible for chips (Futcher et al. 1999). Conversely, SAGE values are less reliable for lowly expressed genes since there is a chance that one might not sequence a SAGE tag corresponding to such a gene altogether. Therefore, if after the last iteration, the average Gene Chip expression level V_i was both above a certain threshold b and below the SAGE expression level S_i for the same gene, it was replaced with the SAGE value; otherwise the average Gene Chip value was kept. This gave us our final expression set w_mRNA. Our treatment of the SAGE data is modeled after that in Futcher et al. (1999), and like them, we used b = 16. This incorporation of the SAGE data into the reference data set ensures that the highly expressed outliers are as accurate as possible. Rather than plain arithmetic averaging, this overall scaling procedure with the a cutoff avoids “artificial averages” that combine very different values for a particular gene. Some expression values might be statistical outliers. In addition, it may be possible that the expression levels of a variety of genes can only be within mutually exclusive ranges or modes, such as when two alternative pathways are switched on or off. Simply averaging these would give values that are less representative of the particular mode values. This situation is analogous to that in averaging together an ensemble of protein structures, say from an NMR structure determination. Each structure in the ensemble could be stereochemically correct, with all side-chain atoms in predefined rotamer configurations. However, an average of all structures in the ensemble could yield one that is stereochemically incorrect if this involved averaging over particular side-chains in different rotameric states. With regard to our regression analysis, we have investigated both non-linear and linear fits but found a non-linear procedure to be more advantageous. The non-linear relationship between different expression datasets perhaps reflects saturation in one or more of the gene chips -- not an uncommon phenomenon. This non-linearity is immediately evident on scatter plots of two datasets against one another (see website). Accordingly, the non-linear fit produces a smaller residual than the linear fit: 98297 (non-linear) versus 122182 (linear) for the scaling of the Church dataset and 59828 (non-linear) versus 67462 (linear) for the Samson dataset.