text_figures[1]

Greenbaum et al 9 Enrichment of Features Formalism Figure 2 focuses on individual proteins. In the next part of our analysis, we want to group a number of proteins together into various categories based on common features and characterize those features that are enriched in one population relative to another, i.e. the translatome population of proteins as measured by 2D gels relative to the transcriptome population of transcripts or the genome population of genes. To this end, we set up a formalism that could be applied universally to all the attributes that we were interested in. Due to the limitations of the experiments, the translatome, transcriptome, and genome populations are defined on different sets of genes, and sometimes we want to remove this “selection bias” by forcing them to be compared on exactly the same set of genes. This is a key aspect of our formalism as presented in figure 1. We call an entity like [w, G] a "population", where G is a set describing a particular selection of genes from the genome and w is vector of weights associated with each element of this population. In particular, we focus on three main populations here: (i) [1,G_Gen] is the population of genes in the genome, all 6280 genes weighted once (w = 1). (ii) [w_mRNA, G_mRNA] is the observed population of the transcripts in the transcriptome, i.e. the 6249 genes in the reference expression set weighted by their reference expression value. (iii)[w_Prot, G_Prot] is the observed cellular population of the proteins in the translatome, i.e. the 181 genes in the reference abundance set weighted by their reference abundance value. (The set of genes in the genome G_Gen is approximately equal to the genes in set G_mRNA, such that we can use both symbols interchangeably.) We can also use this notation to describe specific experiments -- e.g. [w_lacZ,G_lacZ] describes the gene set and weights relating to the Transposon Abundance set. Furthermore, we define F_j as the value of a feature F in ORF j. For example, F could be the composition of leucine (a real number) or a binary value (0 or 1) indicating whether an ORF contains a trans-membrane segment. Given these definitions, the weighted average of feature F in population [w, G] is: Î Î º G j j G j j j w F w G F ]) , [ , ( w m The weighted averages of two populations [w, G] and [v, S] can be compared by simply looking at their relative difference : ]) , [ , ( ]) , [ , ( ]) , [ , ( ]) , [ ], , [ , ( G F G F S F G S F w w v w v m m m - = D where v and w are weights for the sets of ORFs S and G respectively. We call the "enrichment" of feature F because it indicates whether F is enriched (if is positive) or depleted (if is