Greenbaum et al
7
Then
(
)
i
i
i
U
Y
V
+
=
2
1
Else if only Yi exists, Vi = Yi
Else Vi = Ui
As presented above, where only one data set has a value for the corresponding ORF, we
incorporated that value and did not exclude it. When both data sets have values for an ORF, we
averaged the values if they were within 15% of each other; otherwise, we just stayed with the
original chip data set Ui. We used a = 15% in order to prevent outliers from skewing the result.
This 15% value is a reasonable threshold for excluding outliers though other values (e.g. 10% or
20%) would give similar results (data not shown). Other data sets are subsequently included in the
same procedure, continuing the iteration from the new expression values Vi. The initial iteration
starts with the Young Expression Set as Ui since we have the highest confidence in its accuracy.
The SAGE data was not included in the above procedure since it is of a fundamentally different
nature. An advantage of the SAGE technology over Gene Chips is that there is no possible signal
saturation for high expression levels, as is possible for chips (Futcher et al. 1999). Conversely,
SAGE values are less reliable for lowly expressed genes since there is a chance that one might not
sequence a SAGE tag corresponding to such a gene altogether. Therefore, if after the last iteration,
the average Gene Chip expression level Vi was both above a certain threshold
b
and below the
SAGE expression level Si for the same gene, it was replaced with the SAGE value; otherwise the
average Gene Chip value was kept. This gave us our final expression set wmRNA. Our treatment of
the SAGE data is modeled after that in Futcher et al. (1999), and like them, we used
b
= 16.
This incorporation of the SAGE data into the reference data set ensures that the highly expressed
outliers are as accurate as possible.
Rather than plain arithmetic averaging, this overall scaling procedure with the
a
cutoff avoids
artificial averages that combine very different values for a particular gene. Some expression
values might be statistical outliers. In addition, it may be possible that the expression levels of a
variety of genes can only be within mutually exclusive ranges or modes, such as when two
alternative pathways are switched on or off. Simply averaging these would give values that are less
representative of the particular mode values. This situation is analogous to that in averaging
together an ensemble of protein structures, say from an NMR structure determination. Each
structure in the ensemble could be stereochemically correct, with all side-chain atoms in predefined
rotamer configurations. However, an average of all structures in the ensemble could yield one that
is stereochemically incorrect if this involved averaging over particular side-chains in different
rotameric states.
With regard to our regression analysis, we have investigated both non-linear and linear fits but
found a non-linear procedure to be more advantageous. The non-linear relationship between
different expression datasets perhaps reflects saturation in one or more of the gene chips -- not an
uncommon phenomenon. This non-linearity is immediately evident on scatter plots of two datasets
against one another (see website). Accordingly, the non-linear fit produces a smaller residual than
the linear fit: 98297 (non-linear) versus 122182 (linear) for the scaling of the Church dataset and
59828 (non-linear) versus 67462 (linear) for the Samson dataset.