![]() |
|
Title: Simulation
study of the Gene set analysis algorithms
Contents
Supplementary
document
1.1.
Overview
1.2.
Background
1.5.
References
1.1 Overview (Top)
Gene
set analysis focuses on the behavior of a gene set rather than individual genes
and has been successfully applied to the identification of altered expression
patterns and signaling pathways using gene expression data. There are several
gene set analysis algorithms and programs available to researchers
(Supplementary Table 1), and thus we performed a series of simulations to
compare algorithms based on the z-test, t-test, gene permutation or sample
permutation. We conclude that gene permutation methods give better overall
performance. If, however, a sample permutation method is chosen, the
t-value-based parametric method is superior to gene set enrichment analysis
methods.
1.2 Background (Top)
Recent developments in
high throughput technologies such as transcriptomics, proteomics, and
metabolomics have spurred the development of various tools to help researchers
obtain biological insight from the large amounts of data generated by these
techniques. Gene set analysis, as implemented by Mootha et al. [1], includes
the analysis of gene expression differences in muscle from normal and diabetic
patients, and is one such tool that has proven useful in interpreting
microarray gene expression data.
The basic idea of gene set analysis is to
compare two groups by looking at changes in sets of genes, rather than looking at
the individual gene level. This type of
analysis compensates for the fact that small changes not seen at the gene level
are often detected when the gene set as a whole is examined. Another important
aspect of gene set analysis is the use of the entire gene set rather than a few
hundred selected genes from statistical analysis. These two concepts, analysis
at the gene set level and the use of functional class scoring instead of a
threshold gene list, were first introduced by Pavlidis et al. [2]. Inspired by
Mootha et al. [1], several researchers have developed algorithms (Supplementary Table 1) for gene set analysis [3-10]. The gene set
enrichment analysis (GSEA) have recently incorporated into a web server
application (Babelomics, http://babelomics.bioinfo.cipf.es)
[11] and the microarray data analysis package (BRB-Array Tools, http://linus.nci.nih.gov/BRB-ArrayTools.html)
[12].
Supplementary Table 1. Comparison of Gene set analysis tools
|
|
GAzer |
PAGE |
GSEA |
ErmineJ |
MEGO |
Catmap |
T-profiler |
|
Used
Statistics |
Fold
Change, T-statistic, Rank |
Fold Change |
Rank |
P-value |
Fold change |
P-value |
T-statistic |
|
Statistical
test |
Z-test, Permutation
(gene, sample) |
z-test |
permutation (sample) |
permutation (gene) |
- |
permutation (sample, gene) |
t-test |
|
Speed |
Depend on
method -fast,
slow |
Fast |
Slow |
Moderate |
Fast |
Slow |
Fast |
|
Standalone
software |
NO |
YES |
YES |
YES |
YES |
YES |
NO |
|
GUI |
YES |
NO |
YES |
YES |
YES |
NO |
NO |
|
Web
server |
YES |
NO |
NO |
NO |
NO |
NO |
YES |
|
Organism |
Human, mouse, rat, yeast |
Human, mouse |
Human, mouse |
Human, mouse, rat |
Human |
Human |
Yeast |
|
Gene
Sets |
GO Pathways Chromosomes InterPro
Domain Composite
GO cis-regulatory
elements |
GO Pathways Chromosomes |
GO Pathways Chromosomes |
GO Pathways Chromosomes |
GO Pathways Chromosomes |
GO Pathways Chromosomes |
GO Pathways Chromosomes |
Although recently
developed gene set analysis algorithms use the same basic ideas, each algorithm
uses a different statistical inference method to evaluate the significance of
the calculated test statistic. The GSEA algorithm implemented by Mootha et al.
uses sample permutation [1,8], whereas others use either gene permutation
[2,7], gene and sample permutation [9], the z-test [6] or t-test
[4]. The choice of a statistical inference method is important because it
affects factors such as statistical sensitivity, specificity, computing speed,
and ease of implementation. The algorithms also differ in their choice of the
test statistic used for analysis. The original GSEA used ranks obtained from
signal to noise ratios between two groups, but the modified GSEA algorithm
allows for the weighting of each rank by the degree of correlation with
clinical information [1,8]. Other algorithms use t-values [4,9], p-values
[2,7], or log2-transformed fold change values between two groups [6] as the
main test statistic.
The increased use of
gene set analysis has created a need for an objective guide to choose between
available gene set analysis algorithms. This should benefit both software
developers and end users. Toward this goal, we have compared gene set analysis algorithms
and examined their statistical sensitivity by performing a series of
simulations. The simulations reveal
that gene permutation methods give better overall performance and, among sample
permutation methods, t-value-based parametric method is superior to rank-based
gene set enrichment analysis method. We include a few practical guidelines.
1.3 Materials and Methods (Top)
1.3.1 Algorithms
A total of six algorithms
were compared: the original GSEA (GSEA_O) [1], the new GSEA (GSEA_N) [8], the
gene permutation (NT) and sample permutation (ET) of Tian et al. [9], t-test
of T-Profiler [4], and z-test (parametric analysis of gene set
enrichment) with fold change values (PAGE) [6]. The R code for the old and new
GSEA was downloaded from the GSEA website (http://www.broad.mit.edu/gsea/software/software_index.html).
The other algorithms were implemented as described in the original papers
[4,6,9]. For algorithms that required them, 1000 permutations (either gene or
sample) were performed. The R scripts that implemented these algorithms are given
as Supplementary file 1 (comparing algorithms),
and file 2 (gene-gene
correlation).

Supplementary Figure 1. A schematic diagram of the simulation study. Two groups of sample size S
and gene number (N-G) are prepared by random sampling from the standard normal
distribution. The control group is then added by a data matrix of sample size S
and gene set size G prepared by random sampling from the standard normal
distribution, whereas the treatment group is added by a data matrix in which
all genes in a gene set G are added according to a predefined fold change value
(FC). The
purpose of the simulation study is to determine how well the gene set G—with a
predefined fold change value FC—is identified as significant by each algorithm.
1.3.2 Simulation
study
Five variables were
tested in a series of simulations: the total number of genes in a data set (N), gene set size
(G), sample size (S), fold change between two groups (FC), and the percentage
of coordinately changing genes in a gene set (PCG). In each
simulation, four variables were fixed and the remaining one was varied. Two
groups, control and treatment, were constructed by randomly sampling from the
standard normal distribution. With the total number of genes N and the gene set
size G, two groups were sampled equally from the standard normal distribution
for (N-G) genes. For G genes, after sampling from the standard normal
distribution, the fold change (FC) was added only to genes in the treatment
group (Supplementary
Fig. 1). Then, each algorithm was applied to the
simulated data set to produce gene set scores. Gene set scores from the six
permutation-based algorithms were standardized by the following formula [9].
When the gene set score of a simulated data is GS1 and those of permuted data
sets are GSP = (GS2 GS3 ¡¦. GSk),

A p-value from the
T-profiler algorithm was converted into a corresponding z-value by
applying a cumulative standard normal distribution function. No standardization
was necessary for gene set scores obtained from the two z-tests (PAGE
and PAGE_t). In addition, we investigated the effects of gene-gene
co-regulation on the performance of each algorithm with simulated data sets in
which genes within a gene set had varying degrees of correlation to each other.
1.4 Results and Discussion (Top)
1.4.1 The
1¡¯st simulation test for the effect of fold change
The first simulation
study compared each algorithm's ability to detect a gene set as significant, as
fold change values between two groups were varied from 0 to 1 by increments of
0.05. As expected, higher fold change values resulted in higher z-scores for
all tested algorithms (Supplementary
Fig. 2). However, the difference in the performance of
each algorithm became evident as fold change values between two groups were
increased. The best performance was obtained from the PAGE, and NT algorithms;
ET, T-profiler, and GSEA_O_M algorithms followed, and the GSEA methods gave the
lowest performance level (Supplementary
Fig. 2a). The pattern remained essentially the same when
the sample size was increased from 10 to 50 (Supplementary Fig. 2b).

Supplementary Figure 2. Comparison of the performance of each algorithm as the fold change (FC)
value between two groups is varied from 0.0 to 1.0 by increments of 0.05. Six
algorithms were compared. N: total genes in a data set, S: sample size in each
group, G: a gene set size, and PCG: a percentage of coordinately changing genes
in a gene set G. Specific parameter values are shown in each of panels A and B.
1.4.2 The
2¡¯nd simulation test for the effect of sample size
In the second simulation
study, sample size was changed from two to ten by increments of one, and then from
10 to 100 by increments of 10 (Supplementary
Fig. 3). The pattern of performance remained unchanged,
but there was a noticeable difference among PAGE and NT algorithms when the
sample size was below five (Supplementary
Fig. 3a). Using sample permutation methods is also
discouraged at such a low sample size [8].

Supplementary Figure 3. Comparison of the performance of each algorithm as the sample size (S) of
a group or the gene set size (G) is varied. N: total genes in a data set, S:
sample size in each group, G: a gene set size, and PCG: a percentage of
coordinately changing genes in a gene set G. Specific parameter values are
shown in each of panels A–D.
1.4.3 The
3¡¯rd simulation test for the effect of gene set size
In the third simulation,
gene set size was changed from 10 to 100 by increments of 10 at two different sample
sizes (10 and 100). Again, the pattern of performance generally remained
unchanged, but the difference in performance among gene permutation or z-test
algorithms (NT and PAGE) and sample permutation algorithms (ET and two GSEA
algorithms), became more significant. Essentially, sample permutation
algorithms were not responsive to increased gene set size, whereas gene
permutation or z-test methods gave increased performance.
1.4.4 The
4¡¯th simulation test for the effect of the percent of coordinately changed
genes
In the last simulation,
the percent of coordinately changed genes was varied from 0.1 to 0.3 by
increments of 0.02 (Supplementary
Fig. 4). The result showed that PAGE, NT and ET methods
performed better in the change of sample number.

Supplementary Figure 4. Comparison of
the performance of each algorithm as the percentage of coordinately changing
genes is varied from 0.1 to 0.3 by increments of 0.02. N: total genes in a data set, S: sample size in each group, G: a gene set
size, and PCG: a percentage of coordinately changing genes in a gene set G. Specific parameter
values are shown in each of panels A and B.
1.4.5 The 5¡¯th
simulation test for the effect of gene-gene co-regulation
Above simulations were performed in a
condition in which data sets were prepared by random sampling from standard
normal distribution; however, in reality, many genes are co-regulated, and
several studies pointed out that the violation of the independency of each gene
is a non-trivial problem in gene permutation based approaches [5, 7-9, 12]. To
investigate the effects of gene-gene co-regulation on the performance of each
algorithm, we prepared simulated data sets in which genes within a gene set had
varying degrees of correlation to each other.


Supplementary
Figure 5. Effects of gene-gene
correlation within a gene set on the performance of each algorithm. A.
Schematic diagram of the simulation study incorporating gene-gene correlation
within a gene set. Within a gene set (G), one gene (G1) was prepared by random
sampling from standard normal distribution (step 1). Then, the other genes
(G2..GG) in the gene set were prepared by a linear combination of two vectors -
G1 and a second vector prepared by random sampling (step 2). The degree of gene-gene correlation was adjusted by a weight
c. Finally, fold change (a) was added to the genes in the treatment
samples (step 3). B. The degree of gene-gene correlation within a gene set is
proportional to the weight c. Mean correlation was calculated by averaging all
pair-wise correlation between genes within a gene set. C-F. Effects of
gene-gene correlation on the performance of each algorithm. Specific parameter
values are shown in each of panels C-F.
Supplementary Fig. 5a is a schematic diagram
describing how correlated data sets were prepared and Fig. 5b shows that the
mean correlation between genes within a data set increases proportional to the
weight c. We then investigated the performance of each algorithm in various
settings as the weight c varied from 0 (no correlation) to 0.9 (high
correlation) (Supplementary Fig. 5c-f). Two features were noticeable. First,
T-profiler algorithm was most affected as the weight c approached 1. Second,
sample permutation based algorithms (ET, GSEA_N, and GSEA_O) showed a tendency
to produce lower z-scores while gene
permutation based algorithms (PAGE and NT) showed tendency to produce higher z-scores as gene-gene correlation
increased (Supplementary Fig. 5c-f). The first observation is easily
understandable because T-profiler is based on t-test between genes within a gene set (G) and genes outside the
gene set (N-G; see Supplementary Fig. 1); higher correlation between genes
within a gene set will decrease the standard deviation of the group G and thus
result in higher t-value. The reason
for the second observation is at present unclear, but one obvious result is the
bigger difference between gene permutation based algorithms and sample
permutation based algorithms as genes are more correlated.
Our results show that,
in general, gene permutation or z-test based algorithms performed better
than sample permutation or t-test based methods. Among sample permutation
methods, the ET algorithm of Tian et al. [9] performed better than either the
old or new GSEA methods [8].
The nature of the
statistical test, whether gene or sample permutation, z-test, or t-test,
is the most important difference between algorithms. Simulations in this work
showed that z-test or gene permutation methods almost always performed
better than sample permutation methods. Contrary to previous findings, improved
performance of gene permutation methods was attributed to the generally much
larger number of genes compared with sample size rather than to gene
correlation (Supplementary
Fig. 4).
The second important
difference between algorithms is the test statistic used in the analysis. The
original GSEA method uses only the rank of expression differences between two
groups to calculate the enrichment score. Later studies showed that parametric
approaches using the actual fold change or t-statistic increase the
statistical sensitivity of gene set analysis [6,8,9]. Our simulation again confirmed
this point. Parametric approaches using either fold change or the t-statistic
performed better than the rank-based old GSEA algorithm.
In summary, this simulation study suggests
several points. First, gene permutation or z-test methods generally perform
better than sample permutation or t-test methods because the number of
genes is much larger than sample size. Second, when sample size is small,
sample permutation methods are impractical; only gene permutation or z-test
methods are suitable. Third, when sample size is less than five, using fold
change as an input yields better performance than using the t-statistic;
when sample size is more than five, fold change values and the t-statistic
produce similar results. Fourth, among sample permutation–based methods, the
method of Tian et al. [9] always
performs better than either the old or new GSEA methods [1,8]. Finally, gene
permutation based methods (PAGE and NT) and sample permutation based methods
(ET, GSEA_N, and GSEA_O) showed opposite tendencies in their performances as
gene-gene correlation within a gene set increased. Users should caution this
point.
1.5 References (Top)
[1] Mootha, V.K. et al. (2003) PGC-1alpha-responsive genes
involved in oxidative phosphorylation are coordinately downregulated in human
diabetes. Nat Genet 34,267-73.
[2] Pavlidis, P., Lewis, D.P. and Noble, W.S. (2002)
Exploring gene expression data with class scores. Pac Symp Biocomput, 474-85.
[3] Al-Shahrour, F., Diaz-Uriarte, R. and Dopazo, J.
(2005) Discovering molecular functions significantly related to phenotypes by
combining gene expression data and biological information. Bioinformatics 21,
2988-93.
[4] Boorsma, A., Foat, B.C.,
[5] Breslin, T.,
[6] Kim, S.Y. and Volsky, D.J. (2005) PAGE: parametric
analysis of gene set enrichment. BMC Bioinformatics 6, 144.
[7] Lee, H.K., Braynen, W., Keshav, K. and Pavlidis, P.
(2005) ErmineJ: tool for functional analysis of gene expression data sets. BMC
Bioinformatics 6, 269.
[8] Subramanian, A. et al. (2005) Gene set enrichment
analysis: a knowledge-based approach for interpreting genome-wide expression
profiles. Proc Natl Acad Sci U S A 102, 15545-50.
[9] Tian, L.,
[10] Tu, K., Yu, H. and Zhu, M. (2005) MEGO: gene
functional module expression based on gene ontology. Biotechniques 38, 277-83.
[11] Al-Shahrour, F., Minguez, P., Vaquerizas, J.M.,
Conde, L. and Dopazo, J. (2005) BABELOMICS: a suite of web tools for functional
annotation and analysis of groups of genes in high-throughput experiments. Nucleic
Acids Res 33, W460-4.
[12] Barry, W.T., Nobel, A.B. and Wright, F.A. (2005)
Significance analysis of functional categories in gene expression studies: a
structured permutation approach. Bioinformatics 21, 1943-9.
|