Performing the GSEA analysis
After you have opened your GSEA software, first thing you want to do is to load data files (.gct and .cls) you previously created.
You will do this by clicking Load data in the left panel, and then by selecting Method 1: Browse for files:
A pop-up window will show up, taking you to your local folder where you need to find your GSEA files (data file - .gct and phenotype label file - .cls), they should look like in the image below, and both of them should be selected/loaded into the software:
Run GSEA - Set parameters
To be able to perform GSEA analysis properly, first you need to select running parameters that best fit your needs. This of course means that you need to know what you're doing, otherwise it doesn't make much sense, right? :)
In Run GSEA window you can see that adjustable parameters are divided into Required fields, Basic fields and Advanced fields. We will keep our settings within Required and Basic fields for this tutorial.
In this drop-down menu you need to select .gct file you previously loaded.
Gene sets database
This is where you select a collection of gene sets you want to use, depending on biological context you want to test your data for. On this page you may find detailed descriptions of all gene set collections available in the GSEA software.
I usually analyse my data by using several collections, and then see what results make most sense to my particular biological question.
For this particular analysis, I have selected Hallmark collection.
Number of permutations
During the process of determining the statistical significance of the enrichment score, GSEA performs certain number of gene set permutations. It is recommended to use maximum number of permutations that won't make GSEA run out of memory, which is 1000.
Here you need to select phenotype label (.cls) file you previously loaded. In the pop-up window you will have two options, and you should always go for phenotype labels file that puts your investigated phenotype first, and then your control phenotype (I will explain in the next post why this is important):
Collapse dataset to gene symbols
Very important parameter to set up. What is this about actually?
Genes in expression data can be annotated in two ways - in form of gene names/symbols, and in form of microarray chip annotations (for those expression data obtained by microarray analysis):
In cases when your input data contains gene names like in the image on the left, you should select false, which means that GSEA will use input gene names as they are when generating output files - results.
On the other hand, when you have microarray chip annotations, you need to select true, meaning that GSEA will "translate" those annotations which are not informative for us into gene names when generating analysis results.
In our analysis I have selected false because I already have gene names in my input data file.
Also very important parameter to consider changing according to your input data. You can choose between gene_set and phenotype permutation. The best way to explain the difference between those two is to use the explanation from GSEA Tutorial:
Phenotype. Random phenotypes are created by shuffling the phenotype labels on the samples. For each random phenotype, GSEA ranks the genes and calculates the enrichment score for all gene sets. These enrichment scores are used to create a null distribution from which the significance of the actual enrichment score (for the actual expression data and gene set) is calculated. This is the recommended method when there are at least seven (7) samples in each phenotype.
Gene_set. Random gene sets, size matched to the actual gene set, are created and their enrichment scores calculated. These enrichment scores are used to create a null distribution from which the significance of the actual enrichment score (for the actual gene set) is calculated. This method is useful when you have too few samples to do phenotype permutations (that is, when you have fewer than seven (7) samples in any phenotype).
If you have at least seven samples in each phenotype, it is recommended to go for phenotype permutation type, because with this permutation type GSEA preserves correlations between the genes in the dataset and the genes in a gene set, while performing phenotype permutations.
If you have less than seven samples within one phenotype (like in our case here), you need to go for gene_set permutation, however, this will have important implications on your data analysis that you need to be aware of - the evaluation of significance of enrichment score will be less stringent, meaning that you must use more strict FDR cut-off when analyzing your data (to be explained).
In case you have microarray expression data and you have selected "Collapse dataset to gene symbols - true", this is where you have to select microarray platform and chip type used, so the GSEA could collapse annotations to gene symbols properly.
Basically here all you will ever have to change is your analysis name, I named my analysis SteemSTEM, this is useful because this is how the folder containing all your GSEA data will be named, and another thing you may change (but it's not mandatory) is the destination folder where your results will be stored. By default, this folder is located in C:\Users\Username\gsea_home\output\date, and you can keep it there, or put it wherever you like.
Running the GSEA
If you got to this part - congratulations! You can finally perform your GSEA analysis, and hope for the best! ;)
In the lower right part of the screen, you will click the Run button, and on the left you will get status in blue letters - Running:
When the analysis is finished, you are supposed to get the "Success" message:
Otherwise, you can also get "Error" message in red letters, meaning that you messed something up, in that case this is what you do:
In my endless days of learning this, it took me lots of trial and error attempts to finally reach the part where I can run the analysis, and when I would see the "Success" message on the screen, it was a real success to me!!!
Hope you enjoyed this tutorial, or at least the funny part of it :)
Next post will be dedicated to deciphering of GSEA results.
Until then, relax and keep steemSTEM! ;)
 Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., ... & Mesirov, J. P. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences, 102(43), 15545-15550.
 GSEA User Guide