Supplementary document

 

 

Title: Simulation study of the Gene set analysis algorithms

 

 

Contents    Supplementary document

 

1.1. Overview

1.2. Background

1.3. Materials and Methods

1.4. Results and Discussion

1.5. References

 

 

 

1.1 Overview (Top)

 

Gene set analysis focuses on the behavior of a gene set rather than individual genes and has been successfully applied to the identification of altered expression patterns and signaling pathways using gene expression data. There are several gene set analysis algorithms and programs available to researchers (Supplementary Table 1), and thus we performed a series of simulations to compare algorithms based on the z-test, t-test, gene permutation or sample permutation. We conclude that gene permutation methods give better overall performance. If, however, a sample permutation method is chosen, the t-value-based parametric method is superior to gene set enrichment analysis methods.

 

 

1.2 Background (Top)

 

Recent developments in high throughput technologies such as transcriptomics, proteomics, and metabolomics have spurred the development of various tools to help researchers obtain biological insight from the large amounts of data generated by these techniques. Gene set analysis, as implemented by Mootha et al. [1], includes the analysis of gene expression differences in muscle from normal and diabetic patients, and is one such tool that has proven useful in interpreting microarray gene expression data.

 

The basic idea of gene set analysis is to compare two groups by looking at changes in sets of genes, rather than looking at the individual gene level. This type of analysis compensates for the fact that small changes not seen at the gene level are often detected when the gene set as a whole is examined. Another important aspect of gene set analysis is the use of the entire gene set rather than a few hundred selected genes from statistical analysis. These two concepts, analysis at the gene set level and the use of functional class scoring instead of a threshold gene list, were first introduced by Pavlidis et al. [2]. Inspired by Mootha et al. [1], several researchers have developed algorithms (Supplementary Table 1) for gene set analysis [3-10]. The gene set enrichment analysis (GSEA) have recently incorporated into a web server application (Babelomics, http://babelomics.bioinfo.cipf.es) [11] and the microarray data analysis package (BRB-Array Tools, http://linus.nci.nih.gov/BRB-ArrayTools.html) [12].

 

Supplementary Table 1. Comparison of Gene set analysis tools

 

GAzer

PAGE

GSEA

ErmineJ

MEGO

Catmap

T-profiler

Used Statistics

Fold Change,

T-statistic,

Rank

Fold Change

Rank

P-value

Fold change

P-value

T-statistic

Statistical test

Z-test,

Permutation (gene, sample)

z-test

permutation

(sample)

permutation

(gene)

-

permutation

(sample, gene)

t-test

Speed

Depend on method

-fast, slow

Fast

Slow

Moderate

Fast

Slow

Fast

Standalone software

NO

YES

YES

YES

YES

YES

NO

GUI

YES

NO

YES

YES

YES

NO

NO

Web server

YES

NO

NO

NO

NO

NO

YES

Organism

Human, mouse, rat, yeast

Human, mouse

Human, mouse

Human, mouse, rat

Human

Human

Yeast

Gene Sets

GO

Pathways

Chromosomes

InterPro Domain

Composite GO

cis-regulatory elements

GO

Pathways

Chromosomes

GO

Pathways

Chromosomes

GO

Pathways

Chromosomes

GO

Pathways

Chromosomes

GO

Pathways

Chromosomes

GO

Pathways

Chromosomes

 

 

Although recently developed gene set analysis algorithms use the same basic ideas, each algorithm uses a different statistical inference method to evaluate the significance of the calculated test statistic. The GSEA algorithm implemented by Mootha et al. uses sample permutation [1,8], whereas others use either gene permutation [2,7], gene and sample permutation [9], the z-test [6] or t-test [4]. The choice of a statistical inference method is important because it affects factors such as statistical sensitivity, specificity, computing speed, and ease of implementation. The algorithms also differ in their choice of the test statistic used for analysis. The original GSEA used ranks obtained from signal to noise ratios between two groups, but the modified GSEA algorithm allows for the weighting of each rank by the degree of correlation with clinical information [1,8]. Other algorithms use t-values [4,9], p-values [2,7], or log2-transformed fold change values between two groups [6] as the main test statistic.

 

The increased use of gene set analysis has created a need for an objective guide to choose between available gene set analysis algorithms. This should benefit both software developers and end users. Toward this goal, we have compared gene set analysis algorithms and examined their statistical sensitivity by performing a series of simulations. The simulations reveal that gene permutation methods give better overall performance and, among sample permutation methods, t-value-based parametric method is superior to rank-based gene set enrichment analysis method. We include a few practical guidelines.

 

 

1.3 Materials and Methods (Top)

 

1.3.1 Algorithms

 

A total of six algorithms were compared: the original GSEA (GSEA_O) [1], the new GSEA (GSEA_N) [8], the gene permutation (NT) and sample permutation (ET) of Tian et al. [9], t-test of T-Profiler [4], and z-test (parametric analysis of gene set enrichment) with fold change values (PAGE) [6]. The R code for the old and new GSEA was downloaded from the GSEA website (http://www.broad.mit.edu/gsea/software/software_index.html). The other algorithms were implemented as described in the original papers [4,6,9]. For algorithms that required them, 1000 permutations (either gene or sample) were performed. The R scripts that implemented these algorithms are given as Supplementary file 1 (comparing algorithms), and file 2 (gene-gene correlation).

 

Supplementary Figure 1. A schematic diagram of the simulation study. Two groups of sample size S and gene number (N-G) are prepared by random sampling from the standard normal distribution. The control group is then added by a data matrix of sample size S and gene set size G prepared by random sampling from the standard normal distribution, whereas the treatment group is added by a data matrix in which all genes in a gene set G are added according to a predefined fold change value (FC). The purpose of the simulation study is to determine how well the gene set G—with a predefined fold change value FC—is identified as significant by each algorithm.

 

 

1.3.2 Simulation study

 

Five variables were tested in a series of simulations: the total number of genes in a data set (N), gene set size (G), sample size (S), fold change between two groups (FC), and the percentage of coordinately changing genes in a gene set (PCG). In each simulation, four variables were fixed and the remaining one was varied. Two groups, control and treatment, were constructed by randomly sampling from the standard normal distribution. With the total number of genes N and the gene set size G, two groups were sampled equally from the standard normal distribution for (N-G) genes. For G genes, after sampling from the standard normal distribution, the fold change (FC) was added only to genes in the treatment group (Supplementary Fig. 1). Then, each algorithm was applied to the simulated data set to produce gene set scores. Gene set scores from the six permutation-based algorithms were standardized by the following formula [9]. When the gene set score of a simulated data is GS1 and those of permuted data sets are GSP = (GS2 GS3 ¡¦. GSk),

 

A p-value from the T-profiler algorithm was converted into a corresponding z-value by applying a cumulative standard normal distribution function. No standardization was necessary for gene set scores obtained from the two z-tests (PAGE and PAGE_t). In addition, we investigated the effects of gene-gene co-regulation on the performance of each algorithm with simulated data sets in which genes within a gene set had varying degrees of correlation to each other.

 

 

1.4 Results and Discussion (Top)

 

1.4.1 The 1¡¯st simulation test for the effect of fold change

 

The first simulation study compared each algorithm's ability to detect a gene set as significant, as fold change values between two groups were varied from 0 to 1 by increments of 0.05. As expected, higher fold change values resulted in higher z-scores for all tested algorithms (Supplementary Fig. 2). However, the difference in the performance of each algorithm became evident as fold change values between two groups were increased. The best performance was obtained from the PAGE, and NT algorithms; ET, T-profiler, and GSEA_O_M algorithms followed, and the GSEA methods gave the lowest performance level (Supplementary Fig. 2a). The pattern remained essentially the same when the sample size was increased from 10 to 50 (Supplementary Fig. 2b).

 

Supplementary Figure 2. Comparison of the performance of each algorithm as the fold change (FC) value between two groups is varied from 0.0 to 1.0 by increments of 0.05. Six algorithms were compared. N: total genes in a data set, S: sample size in each group, G: a gene set size, and PCG: a percentage of coordinately changing genes in a gene set G. Specific parameter values are shown in each of panels A and B.

 

1.4.2 The 2¡¯nd simulation test for the effect of sample size

 

In the second simulation study, sample size was changed from two to ten by increments of one, and then from 10 to 100 by increments of 10 (Supplementary Fig. 3). The pattern of performance remained unchanged, but there was a noticeable difference among PAGE and NT algorithms when the sample size was below five (Supplementary Fig. 3a). Using sample permutation methods is also discouraged at such a low sample size [8].

 

 

 

Supplementary Figure 3. Comparison of the performance of each algorithm as the sample size (S) of a group or the gene set size (G) is varied. N: total genes in a data set, S: sample size in each group, G: a gene set size, and PCG: a percentage of coordinately changing genes in a gene set G. Specific parameter values are shown in each of panels A–D.

 

 

1.4.3 The 3¡¯rd simulation test for the effect of gene set size

 

In the third simulation, gene set size was changed from 10 to 100 by increments of 10 at two different sample sizes (10 and 100). Again, the pattern of performance generally remained unchanged, but the difference in performance among gene permutation or z-test algorithms (NT and PAGE) and sample permutation algorithms (ET and two GSEA algorithms), became more significant. Essentially, sample permutation algorithms were not responsive to increased gene set size, whereas gene permutation or z-test methods gave increased performance.

 

1.4.4 The 4¡¯th simulation test for the effect of the percent of coordinately changed genes

 

In the last simulation, the percent of coordinately changed genes was varied from 0.1 to 0.3 by increments of 0.02 (Supplementary Fig. 4). The result showed that PAGE, NT and ET methods performed better in the change of sample number.

 

 

Supplementary Figure 4. Comparison of the performance of each algorithm as the percentage of coordinately changing genes is varied from 0.1 to 0.3 by increments of 0.02. N: total genes in a data set, S: sample size in each group, G: a gene set size, and PCG: a percentage of coordinately changing genes in a gene set G. Specific parameter values are shown in each of panels A and B.

 

1.4.5 The 5¡¯th simulation test for the effect of gene-gene co-regulation

 

Above simulations were performed in a condition in which data sets were prepared by random sampling from standard normal distribution; however, in reality, many genes are co-regulated, and several studies pointed out that the violation of the independency of each gene is a non-trivial problem in gene permutation based approaches [5, 7-9, 12]. To investigate the effects of gene-gene co-regulation on the performance of each algorithm, we prepared simulated data sets in which genes within a gene set had varying degrees of correlation to each other.

 

 

Supplementary Figure 5. Effects of gene-gene correlation within a gene set on the performance of each algorithm. A. Schematic diagram of the simulation study incorporating gene-gene correlation within a gene set. Within a gene set (G), one gene (G1) was prepared by random sampling from standard normal distribution (step 1). Then, the other genes (G2..GG) in the gene set were prepared by a linear combination of two vectors - G1 and a second vector prepared by random sampling (step 2). The degree of gene-gene correlation was adjusted by a weight c. Finally, fold change (a) was added to the genes in the treatment samples (step 3). B. The degree of gene-gene correlation within a gene set is proportional to the weight c. Mean correlation was calculated by averaging all pair-wise correlation between genes within a gene set. C-F. Effects of gene-gene correlation on the performance of each algorithm. Specific parameter values are shown in each of panels C-F.

 

 

Supplementary Fig. 5a is a schematic diagram describing how correlated data sets were prepared and Fig. 5b shows that the mean correlation between genes within a data set increases proportional to the weight c. We then investigated the performance of each algorithm in various settings as the weight c varied from 0 (no correlation) to 0.9 (high correlation) (Supplementary Fig. 5c-f). Two features were noticeable. First, T-profiler algorithm was most affected as the weight c approached 1. Second, sample permutation based algorithms (ET, GSEA_N, and GSEA_O) showed a tendency to produce lower z-scores while gene permutation based algorithms (PAGE and NT) showed tendency to produce higher z-scores as gene-gene correlation increased (Supplementary Fig. 5c-f). The first observation is easily understandable because T-profiler is based on t-test between genes within a gene set (G) and genes outside the gene set (N-G; see Supplementary Fig. 1); higher correlation between genes within a gene set will decrease the standard deviation of the group G and thus result in higher t-value. The reason for the second observation is at present unclear, but one obvious result is the bigger difference between gene permutation based algorithms and sample permutation based algorithms as genes are more correlated.

 

Our results show that, in general, gene permutation or z-test based algorithms performed better than sample permutation or t-test based methods. Among sample permutation methods, the ET algorithm of Tian et al. [9] performed better than either the old or new GSEA methods [8].

 

The nature of the statistical test, whether gene or sample permutation, z-test, or t-test, is the most important difference between algorithms. Simulations in this work showed that z-test or gene permutation methods almost always performed better than sample permutation methods. Contrary to previous findings, improved performance of gene permutation methods was attributed to the generally much larger number of genes compared with sample size rather than to gene correlation (Supplementary Fig. 4).

 

The second important difference between algorithms is the test statistic used in the analysis. The original GSEA method uses only the rank of expression differences between two groups to calculate the enrichment score. Later studies showed that parametric approaches using the actual fold change or t-statistic increase the statistical sensitivity of gene set analysis [6,8,9]. Our simulation again confirmed this point. Parametric approaches using either fold change or the t-statistic performed better than the rank-based old GSEA algorithm.

 

In summary, this simulation study suggests several points. First, gene permutation or z-test methods generally perform better than sample permutation or t-test methods because the number of genes is much larger than sample size. Second, when sample size is small, sample permutation methods are impractical; only gene permutation or z-test methods are suitable. Third, when sample size is less than five, using fold change as an input yields better performance than using the t-statistic; when sample size is more than five, fold change values and the t-statistic produce similar results. Fourth, among sample permutation–based methods, the method of Tian et al. [9] always performs better than either the old or new GSEA methods [1,8]. Finally, gene permutation based methods (PAGE and NT) and sample permutation based methods (ET, GSEA_N, and GSEA_O) showed opposite tendencies in their performances as gene-gene correlation within a gene set increased. Users should caution this point.

 

 

1.5 References (Top)

 

[1] Mootha, V.K. et al. (2003) PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet 34,267-73.

[2] Pavlidis, P., Lewis, D.P. and Noble, W.S. (2002) Exploring gene expression data with class scores. Pac Symp Biocomput, 474-85.

[3] Al-Shahrour, F., Diaz-Uriarte, R. and Dopazo, J. (2005) Discovering molecular functions significantly related to phenotypes by combining gene expression data and biological information. Bioinformatics 21, 2988-93.

[4] Boorsma, A., Foat, B.C., Vis, D., Klis, F. and Bussemaker, H.J. (2005) T-profiler:scoring the activity of predefined groups of genes using gene expression data.Nucleic Acids Res 33, W592-5.

[5] Breslin, T., Eden, P. and Krogh, M. (2004) Comparing functional annotation analyses with Catmap. BMC Bioinformatics 5, 193.

[6] Kim, S.Y. and Volsky, D.J. (2005) PAGE: parametric analysis of gene set enrichment. BMC Bioinformatics 6, 144.

[7] Lee, H.K., Braynen, W., Keshav, K. and Pavlidis, P. (2005) ErmineJ: tool for functional analysis of gene expression data sets. BMC Bioinformatics 6, 269.

[8] Subramanian, A. et al. (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A 102, 15545-50.

[9] Tian, L., Greenberg, S.A., Kong, S.W., Altschuler, J., Kohane, I.S. and Park, P.J.(2005) Discovering statistically significant pathways in expression profiling studies. Proc Natl Acad Sci U S A 102, 13544-9.

[10] Tu, K., Yu, H. and Zhu, M. (2005) MEGO: gene functional module expression based on gene ontology. Biotechniques 38, 277-83.

[11] Al-Shahrour, F., Minguez, P., Vaquerizas, J.M., Conde, L. and Dopazo, J. (2005) BABELOMICS: a suite of web tools for functional annotation and analysis of groups of genes in high-throughput experiments. Nucleic Acids Res 33, W460-4.

[12] Barry, W.T., Nobel, A.B. and Wright, F.A. (2005) Significance analysis of functional categories in gene expression studies: a structured permutation approach. Bioinformatics 21, 1943-9.

 

 

Korean Bio Information Center 52 Eoeun-dong,Yuseong-gu,Daejeon, 305-333, Korea Korean BioInformation Center(KOBIC)
TEL: 82-42-879-8511 FAX: 82-42-879-8519 www@kobic.re.kr