Greenbaum et al
9
Enrichment of Features
Formalism
Figure 2 focuses on individual proteins. In the next part of our analysis, we want to group a number
of proteins together into various categories based on common features and characterize those
features that are enriched in one population relative to another, i.e. the translatome population of
proteins as measured by 2D gels relative to the transcriptome population of transcripts or the
genome population of genes. To this end, we set up a formalism that could be applied universally
to all the attributes that we were interested in. Due to the limitations of the experiments, the
translatome, transcriptome, and genome populations are defined on different sets of genes, and
sometimes we want to remove this selection bias by forcing them to be compared on exactly the
same set of genes. This is a key aspect of our formalism as presented in figure 1.
We call an entity like [w, G] a "population", where G is a set describing a particular
selection of genes from the genome and w is vector of weights associated with each element of this
population. In particular, we focus on three main populations here:
(i) [1,GGen] is the population of genes in the genome, all 6280 genes weighted once (w = 1).
(ii) [wmRNA, GmRNA] is the observed population of the transcripts in the transcriptome, i.e.
the 6249 genes in the reference expression set weighted by their reference expression
value.
(iii)[wProt, GProt] is the observed cellular population of the proteins in the translatome, i.e.
the 181 genes in the reference abundance set weighted by their reference abundance
value.
(The set of genes in the genome GGen is approximately equal to the genes in set GmRNA, such that we
can use both symbols interchangeably.) We can also use this notation to describe specific
experiments -- e.g. [wlacZ, GlacZ] describes the gene set and weights relating to the Transposon
Abundance set.
Furthermore, we define Fj as the value of a feature F in ORF j. For example, F could be the
composition of leucine (a real number) or a binary value (0 or 1) indicating whether an ORF
contains a trans-membrane segment. Given these definitions, the weighted average of feature F in
population [w, G] is:
Î
Î
º
G
j
j
G
j
j
j
w
F
w
G
F
])
,
[
,
(
w
m
The weighted averages of two populations [w, G] and [v, S] can be compared by simply looking at
their relative difference :
])
,
[
,
(
])
,
[
,
(
])
,
[
,
(
])
,
[
],
,
[
,
(
G
F
G
F
S
F
G
S
F
w
w
v
w
v
m
m
m
-
=
D
where v and w are weights for the sets of ORFs S and G respectively. We call the "enrichment"
of feature F because it indicates whether F is enriched (if is positive) or depleted (if is