Greenbaum et al 
7
Then 
( 
)
i
i
i
U
Y
V
+
=
2
1
 
 
Else if  only Yi exists, Vi = Yi 
Else Vi = Ui 
 
As presented above, where only one data set has a value for the corresponding ORF, we 
incorporated that value and did not exclude it. When both data sets have values for an ORF, we 
averaged the values if they were within 15% of each other; otherwise, we just stayed with the 
original chip data set Ui. We used a = 15% in order to prevent outliers from skewing the result. 
This 15% value is a reasonable threshold for excluding outliers though other values (e.g. 10% or 
20%) would give similar results (data not shown). Other data sets are subsequently included in the 
same procedure, continuing the iteration from the new expression values Vi.  The initial iteration 
starts with the Young Expression Set as Ui since we have the highest confidence in its accuracy. 
 
The SAGE data was not included in the above procedure since it is of a fundamentally different 
nature. An advantage of the SAGE technology over Gene Chips is that there is no possible signal 
saturation for high expression levels, as is possible for chips (Futcher et al. 1999).  Conversely, 
SAGE values are less reliable for lowly expressed genes since there is a chance that one might not 
sequence a SAGE tag corresponding to such a gene altogether.  Therefore, if after the last iteration, 
the average Gene Chip expression level Vi was both above a certain threshold 
b
  
and below the 
SAGE expression level Si for the same gene, it was replaced with the SAGE value; otherwise the 
average Gene Chip value was kept. This gave us our final expression set wmRNA. Our treatment of 
the SAGE data is modeled after that in Futcher et al. (1999), and like them, we used 
b
 
= 16. 
This incorporation of the SAGE data into the reference data set ensures that the highly expressed 
outliers are as accurate as possible. 
  
Rather than plain arithmetic averaging, this overall scaling procedure with the 
a
 
cutoff avoids 
artificial averages that combine very different values for a particular gene. Some expression 
values might be statistical outliers. In addition, it may be possible that the expression levels of a 
variety of genes can only be within mutually exclusive ranges or modes, such as when two 
alternative pathways are switched on or off. Simply averaging these would give values that are less 
representative of the particular mode values. This situation is analogous to that in averaging 
together an ensemble of protein structures, say from an NMR structure determination. Each 
structure in the ensemble could be stereochemically correct, with all side-chain atoms in predefined 
rotamer configurations. However, an average of all structures in the ensemble could yield one that 
is stereochemically incorrect if this involved averaging over particular side-chains in different 
rotameric states. 
 
With regard to our regression analysis, we have investigated both non-linear and linear fits but 
found a non-linear procedure to be more advantageous.  The non-linear relationship between 
different expression datasets perhaps reflects saturation in one or more of the gene chips -- not an 
uncommon phenomenon.  This non-linearity is immediately evident on scatter plots of two datasets 
against one another (see website).  Accordingly, the non-linear fit produces a smaller residual than 
the linear fit: 98297 (non-linear) versus 122182 (linear) for the scaling of the Church dataset and 
59828 (non-linear) versus 67462 (linear) for the Samson dataset.