class: center, middle, inverse, title-slide .title[ # Lecture 9: Gene set enrichments ] .subtitle[ ## BE_22 Bioinformatics SS 21 ] .author[ ### January Weiner ] .date[ ### 2024-06-18 ] --- <!-- class:empty-slide,myinverse background-image:url(images/arnolfini.jpg) --> ## Gene set enrichment analysis The problem: how do you interpret a list of 1000 results? ---
--- class:empty-slide,myinverse background-image:url(images/desert2.jpg) .mytop[ The number of grains of sand in this picture accurately represents the number of approaches for functional analysis of gene expression. ] --- ## Plan for today * Gene set enrichment analysis * 1st generation: hypergeometric test, Fisher's exact test * 2nd generation: other algorithms * 3rd generation: pathway topology analysis --- ## Gene ontology database✪ .pull-left[ GO – Gene Ontology * Genes are assigned to GO categories * GO categories are assigned to each other * GO can be represented as a *directed acyclic graph* ] .pull-right[  ] --- ## Gene ontology database✪ * Three main categories: biological process (BP), molecular function (MF), cellular component (CC) * Each category is a so-called "directed acyclic graph" (DAG) * Each gene can be assigned to multiple categories * Each category can have multiple "parents" and "children" (so it is not a tree!) --- ## KEGG pathways✪ .pull-left[ KEGG – Kyoto Encyclopaedia of Genes and Genomes KEGG pathways - a manually curated collection of ~ 300 pathways ] .pull-right[  Minoru Kanehisa (金久 實), the creator of KEGG ] --- class:empty-slide,myinverse background-image:url(images/hsa04110.gse16873.png) --- ## Other gene sets Different lists of genes can be obtained from various sources. * Coexpression data sets: groups of genes that have correlated expression profiles -- * Results from another differential gene expression analysis experiment: groups of genes significant in a comparison -- * Targets of a particular miRNA determined computationally --- ## MSigDB – a meta database of gene sets✪ .pull-left[  ] --- ## Group of genes, and then what?✪ First, define DEGs: differentially expressed genes (e.g. `abs(log2FoldChange) > 1`, `adj_p_val < 0.001`) Second, define which genes are in a given gene set (GS), and which are not. .pull-left[ **No significant enrichment:** | |DEGs |non-DEGs|Fraction| |----|-----|--------|--------| |in GS|5 | 95| 5%| |non in GS|50 |1000|5%| |Fraction|9%|9.5%|| ] -- .pull-right[ **Significant enrichment:** | |DEGs |non-DEGs|Fraction| |----|-----|--------|--------| |in GS|50 | 60| 45%| |non in GS|70 |1000|7%| |Fraction|42%|6%|| ] How can we test it? --- .pull-left[ **No significant enrichment:** | |DEGs |non-DEGs|Fraction| |----|-----|--------|--------| |in GS|5 | 95| 5%| |non in GS|50 |1000|5%| |Fraction|9%|9.5%|| ] -- .pull-right[ **Significant enrichment:** | |DEGs |non-DEGs|Fraction| |----|-----|--------|--------| |in GS|50 | 60| 45%| |non in GS|70 |1000|7%| |Fraction|42%|6%|| ] -- .pull-left[ Results of `\(\chi^2\)` test: Left: <pre class="r-output"><code>## ## Pearson's Chi-squared test with Yates' continuity correction ## ## data: matrix(c(5, 95, 50, 1000), byrow = TRUE, nrow = 2) ## X-squared = 2.0686e-30, df = 1, p-value = 1 </code></pre> Right: <pre class="r-output"><code>## ## Pearson's Chi-squared test with Yates' continuity correction ## ## data: matrix(c(50, 60, 70, 1000), byrow = TRUE, nrow = 2) ## X-squared = 161.1, df = 1, p-value < 2.2e-16 </code></pre> ] --- ## Disadvantages of the generation 1 approach * You have to define DEGs... but that depends on the number of samples * the larger the number of samples, the higher statistical power * the higher statistical power, the more DEGs you can detect * therefore, the results depend on an arbitrary threshold * What to do if you have no DEGs at all? --- ## Generation 2 approaches – "Functional Class Sorting" * do not depend on arbitrary thresholds --- ## GSEA algorithm✪  --- ## GSEA algorithm * Calculate the (pseudo) Kolmogorov-Smirnov statistic for each gene set * The statistic has a value, but no associated p-value * Test the statistic and obtain a p-value by a permutation test --- ## Permutation tests in statistics * Normally, for a given statistic (like the t-test statistic or chi-squared statistic), you can calculate a p-value from the known distribution of the statistic * If you don't know the distribution, you can simulate it by permuting the data * Advantages: no assumptions about the distribution of the statistic * Disadvantages: computationally expensive, and more importantly: you need a sufficient number of samples --- ## tmod algorithm * Arrange the genes according to p-value or another similar metrics * Test whether genes from a given gene set are denser on the beginning of the list --- class:empty-slide,myinverse background-image:url(images/img21.png) --- class:empty-slide,myinverse background-image:url(images/img22.png) --- class:empty-slide,myinverse background-image:url(images/img23.png) --- class:empty-slide,myinverse background-image:url(images/img24.png) --- class:empty-slide,myinverse background-image:url(images/img25.png) --- class:empty-slide,myinverse background-image:url(images/img26.png) --- class:empty-slide,myinverse background-image:url(images/img27.png) --- class:empty-slide,myinverse background-image:url(images/img28.png) --- ### Tmod statistics Since normalized ranking for a gene `\(i\)` `\(r'_{i}=\frac{r_i}{N} \sim U(0,1)\)` -- then `\(-\ln{r'_{i}} \sim Exp(1)\)` -- and `\(-2\cdot\ln{r'_i} \sim \chi^2_2\)` -- thus, if `\(G\)` is a set of genes, `\(-2\cdot\sum_{i \in G}\ln{r'_i} \sim \sum_{i \in G}\chi^2_2 \sim \chi^2_{2\cdot n(G)}\)` This gives a fast and reliable statistics with a known distribution which can be used to associate a p-value to any set of genes. ---  .footnote[*Zyla J, Marczyk M, Domaszewska T, Kaufmann SH, Polanska J, Weiner 3rd J. Gene set enrichment for reproducible science: comparison of CERNO and eight other algorithms. Bioinformatics. 2019 Dec 15;35(24):5146-54.*] ---  ---  .pull-left[Vaccine with an adjuvant] .pull-right[Vaccine without an adjuvant] .footnote[*Weiner, January, et al. "Characterization of potential biomarkers of reactogenicity of licensed antiviral vaccines: randomized controlled clinical trials conducted by the BIOVACSAFE consortium." Scientific reports 9.1 (2019): 1-14.*] --- ## Network analysis (generation 3) Genes / proteins may be linked by * co-expression * common pathways * literature (co-citation) * synteny (co-occurence in genomes) * common motifs --- ## STRING 