class: center, middle, inverse, title-slide .title[ # Lecture 8: Gene expression analysis ] .subtitle[ ## BE_22 Bioinformatics SS 21 ] .author[ ### January Weiner ] .date[ ### 2024-06-10 ] --- <!-- class:empty-slide,myinverse background-image:url(images/arnolfini.jpg) --> ## Plan for today * Gene expression analysis – as an in-depth example of high-throughput analyses --- ## Transcriptomic methods * QPCR: precise, low-throughput * Nanostring: precise, mid-throughput (~ 500-1000 genes) * Microarray: less exact, high-throughput, pre-defined genes * RNA-Seq: very flexible, less exact, high-throughput --- ## QC – quality control * Method-dependent quality measures * aligned reads * duplications * intron / exon binding * Y chromosome binding * adapter content * Bias? Clustering? --- ## (Demo) Example RNA-Seq QC Document derived using the `multiQC` package can be retrieved [here](multiqc.all_samples.all_mates.qc_report.html) --- ## The GSE156063 data set <table> <thead> <tr> <th style="text-align:left;"> Col </th> <th style="text-align:left;"> Class </th> <th style="text-align:right;"> NAs </th> <th style="text-align:right;"> unique </th> <th style="text-align:left;"> Summary </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> label </td> <td style="text-align:left;"> <chr> </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 234 </td> <td style="text-align:left;"> All values unique </td> </tr> <tr> <td style="text-align:left;"> filename </td> <td style="text-align:left;"> <chr> </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 234 </td> <td style="text-align:left;"> All values unique </td> </tr> <tr> <td style="text-align:left;"> md5 </td> <td style="text-align:left;"> <chr> </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 234 </td> <td style="text-align:left;"> All values unique </td> </tr> <tr> <td style="text-align:left;"> replicate </td> <td style="text-align:left;"> <int> </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> 1 </td> </tr> <tr> <td style="text-align:left;"> title </td> <td style="text-align:left;"> <chr> </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 234 </td> <td style="text-align:left;"> All values unique </td> </tr> <tr> <td style="text-align:left;"> geo_accession </td> <td style="text-align:left;"> <chr> </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 234 </td> <td style="text-align:left;"> All values unique </td> </tr> <tr> <td style="text-align:left;"> age </td> <td style="text-align:left;"> <dbl> </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 68 </td> <td style="text-align:left;"> 20.0 [40.0 <50.5> 65.0] 89.0 </td> </tr> <tr> <td style="text-align:left;"> disease_state </td> <td style="text-align:left;"> <chr> </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> no virus: 100, SC2: 93, other virus: 41 </td> </tr> <tr> <td style="text-align:left;"> gender </td> <td style="text-align:left;"> <chr> </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> F: 124, M: 110 </td> </tr> <tr> <td style="text-align:left;"> group </td> <td style="text-align:left;"> <chr> </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> no: 100, SC2: 93, other: 41 </td> </tr> <tr> <td style="text-align:left;"> sizeFactor </td> <td style="text-align:left;"> <dbl> </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 234 </td> <td style="text-align:left;"> 0.0754 [0.5049 <1.1083> 2.0840] 7.4839 </td> </tr> <tr> <td style="text-align:left;"> replaceable </td> <td style="text-align:left;"> <lgl> </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> TRUE: 234 FALSE: 0 </td> </tr> </tbody> </table> --- ## The GSE156063 data set <table> <thead> <tr> <th style="text-align:left;"> Col </th> <th style="text-align:left;"> Class </th> <th style="text-align:right;"> NAs </th> <th style="text-align:right;"> unique </th> <th style="text-align:left;"> Summary </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> label </td> <td style="text-align:left;"> <chr> </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 234 </td> <td style="text-align:left;"> All values unique </td> </tr> <tr> <td style="text-align:left;"> filename </td> <td style="text-align:left;"> <chr> </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 234 </td> <td style="text-align:left;"> All values unique </td> </tr> <tr> <td style="text-align:left;"> md5 </td> <td style="text-align:left;"> <chr> </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 234 </td> <td style="text-align:left;"> All values unique </td> </tr> <tr> <td style="text-align:left;"> replicate </td> <td style="text-align:left;"> <int> </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> 1 </td> </tr> <tr> <td style="text-align:left;"> title </td> <td style="text-align:left;"> <chr> </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 234 </td> <td style="text-align:left;"> All values unique </td> </tr> <tr> <td style="text-align:left;"> geo_accession </td> <td style="text-align:left;"> <chr> </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 234 </td> <td style="text-align:left;"> All values unique </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: white !important;background-color: #D7261E !important;"> age </td> <td style="text-align:left;font-weight: bold;color: white !important;background-color: #D7261E !important;"> <dbl> </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 0 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 68 </td> <td style="text-align:left;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 20.0 [40.0 <50.5> 65.0] 89.0 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: white !important;background-color: #D7261E !important;"> disease_state </td> <td style="text-align:left;font-weight: bold;color: white !important;background-color: #D7261E !important;"> <chr> </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 0 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 3 </td> <td style="text-align:left;font-weight: bold;color: white !important;background-color: #D7261E !important;"> no virus: 100, SC2: 93, other virus: 41 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: white !important;background-color: #D7261E !important;"> gender </td> <td style="text-align:left;font-weight: bold;color: white !important;background-color: #D7261E !important;"> <chr> </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 0 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 2 </td> <td style="text-align:left;font-weight: bold;color: white !important;background-color: #D7261E !important;"> F: 124, M: 110 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: white !important;background-color: #D7261E !important;"> group </td> <td style="text-align:left;font-weight: bold;color: white !important;background-color: #D7261E !important;"> <chr> </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 0 </td> <td style="text-align:right;font-weight: bold;color: white !important;background-color: #D7261E !important;"> 3 </td> <td style="text-align:left;font-weight: bold;color: white !important;background-color: #D7261E !important;"> no: 100, SC2: 93, other: 41 </td> </tr> <tr> <td style="text-align:left;"> sizeFactor </td> <td style="text-align:left;"> <dbl> </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 234 </td> <td style="text-align:left;"> 0.0754 [0.5049 <1.1083> 2.0840] 7.4839 </td> </tr> <tr> <td style="text-align:left;"> replaceable </td> <td style="text-align:left;"> <lgl> </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> TRUE: 234 FALSE: 0 </td> </tr> </tbody> </table> --- How do the count data look like? <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> RR057e_00037 </th> <th style="text-align:right;"> RR057e_00039 </th> <th style="text-align:right;"> RR057e_00042 </th> <th style="text-align:right;"> RR057e_00047 </th> <th style="text-align:right;"> RR057e_00049 </th> <th style="text-align:right;"> RR057e_00051 </th> <th style="text-align:left;"> ... </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> ENSG00000000003 </td> <td style="text-align:right;"> 40 </td> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 488 </td> <td style="text-align:right;"> 48 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 143 </td> <td style="text-align:left;"> ... </td> </tr> <tr> <td style="text-align:left;"> ENSG00000000419 </td> <td style="text-align:right;"> 18 </td> <td style="text-align:right;"> 43 </td> <td style="text-align:right;"> 163 </td> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 63 </td> <td style="text-align:left;"> ... </td> </tr> <tr> <td style="text-align:left;"> ENSG00000000457 </td> <td style="text-align:right;"> 19 </td> <td style="text-align:right;"> 36 </td> <td style="text-align:right;"> 264 </td> <td style="text-align:right;"> 30 </td> <td style="text-align:right;"> 13 </td> <td style="text-align:right;"> 115 </td> <td style="text-align:left;"> ... </td> </tr> <tr> <td style="text-align:left;"> ENSG00000000460 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 63 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 55 </td> <td style="text-align:left;"> ... </td> </tr> <tr> <td style="text-align:left;"> ENSG00000000938 </td> <td style="text-align:right;"> 11 </td> <td style="text-align:right;"> 66 </td> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 26 </td> <td style="text-align:right;"> 91 </td> <td style="text-align:right;"> 57 </td> <td style="text-align:left;"> ... </td> </tr> <tr> <td style="text-align:left;"> ENSG00000000971 </td> <td style="text-align:right;"> 74 </td> <td style="text-align:right;"> 32 </td> <td style="text-align:right;"> 236 </td> <td style="text-align:right;"> 24 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 73 </td> <td style="text-align:left;"> ... </td> </tr> <tr> <td style="text-align:left;"> ENSG00000001036 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 57 </td> <td style="text-align:right;"> 198 </td> <td style="text-align:right;"> 16 </td> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 76 </td> <td style="text-align:left;"> ... </td> </tr> <tr> <td style="text-align:left;"> ENSG00000001084 </td> <td style="text-align:right;"> 155 </td> <td style="text-align:right;"> 264 </td> <td style="text-align:right;"> 1799 </td> <td style="text-align:right;"> 43 </td> <td style="text-align:right;"> 25 </td> <td style="text-align:right;"> 472 </td> <td style="text-align:left;"> ... </td> </tr> <tr> <td style="text-align:left;"> ENSG00000001167 </td> <td style="text-align:right;"> 24 </td> <td style="text-align:right;"> 18 </td> <td style="text-align:right;"> 135 </td> <td style="text-align:right;"> 26 </td> <td style="text-align:right;"> 40 </td> <td style="text-align:right;"> 101 </td> <td style="text-align:left;"> ... </td> </tr> <tr> <td style="text-align:left;"> ENSG00000001460 </td> <td style="text-align:right;"> 88 </td> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 449 </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 165 </td> <td style="text-align:left;"> ... </td> </tr> </tbody> </table> --- # QC using PCA Dimension reduction with PCA allows us to see batch effects and also have a first preliminary look at the data. --- .pull-left[ ### Before PCA: <!-- --> Correlation between x and y = 0.66 ] .pull-right[ ### After PCA: <!-- --> Correlation between PC1 and PC2 = -0.00 ] --- .pull-left[ ### Before PCA: <!-- --> Correlation between x and y = 0.84 ] .pull-right[ ### After PCA: <!-- --> Correlation between PC1 and PC2 = 0.00 ] --- .pull-left[ ### Before PCA: <!-- --> ] .pull-right[ ### After PCA: <!-- --> ] ---
--- ## PCA "scree plot" <!-- --> --- ## Heatmaps of the most variable genes <!-- --> --- ## Differential gene expression analysis * Finding differences between groups * Models may vary --- ## GLMs – generalized linear models * OLS – ordinary least squares regression * Fit data (independent + dependent variables) to a model – a bit like in OLS * Differences to OLS * **link function** allows models in which variables have non-normal distributions * maximum likelihood estimation (rather than least squares) * Count data: negative binomial distribution of the dependent variables --- ## GLMs – generalized linear models For a given gene, we assume that its counts in a sample `\(i\)` have negative binomial distribution: `$$\text{counts} \sim f_{\text{binom}}(\mu_i, \alpha)$$` (Negative binomial distribution is a generalization of a Poisson distribution that you might know) Statistical test: **Null hypothesis `\(H_0\)`:** For each sample `\(i\)`, `\(\mu_i = \mu_0\)` **Alternative hypothesis `\(H_1\)`:** `$$\log{\mu_i} = \beta_0 + x_i \times \beta_G$$` Where `$$\beta_G = \left\{\begin{array}{ll}0 & x_i \text{is control} \\ 1 & x_i \text{is treatment}\end{array}\right.$$` This model is then tested against the `\(H_0\)` using a Wald test. --- ## Correction for multiple testing .pull-left[ Statistical testing is like a russian roulette: each time you test, you have a chance of drawing a False Positive (FP). If you draw many times, you are bound to have "significant" FP's. * Family-wise error correction (e.g. Bonferroni): guarantees that the chance of having 1 or more FP's in the results significant at `\(\alpha\)` is `\(\alpha\)`. * False-discovery rate (FDR; e.g. Benjamini-Hochberg): guarantees that the *fraction* of FP's in the results significant at `\(\alpha\)` is no more than `\(\alpha\)`. ] .pull-right[ **Remember:** any results not corrected for multiple testing are *not trustworthy!* ] --- ## R equation language R uses a particular representation of linear model equations: e.g. `$$y = \beta_0 + \beta_1 \times y$$` is represented as `y ~ x` Whereas `$$y = \beta_1 \times y$$` (no intercept, i.e. intercept `\(= 0\)`) becomes `y ~ 0 + x` --- <pre class="r-output"><code>## ENSEMBL SYMBOL ENTREZID REFSEQ ## 1 ENSG00000126709 IFI6 2537 NM_002038 ## 2 ENSG00000185745 IFIT1 3434 NM_001270927 ## 3 ENSG00000137959 IFI44L 10964 NM_006820 ## 4 ENSG00000169248 CXCL11 6373 NM_001302123 ## 5 ENSG00000134326 CMPK2 129607 NM_001256477 ## 6 ENSG00000187608 ISG15 9636 NM_005101 ## GENENAME baseMean ## 1 interferon alpha inducible protein 6 603.9613 ## 2 interferon induced protein with tetratricopeptide repeats 1 950.0070 ## 3 interferon induced protein 44 like 1197.3869 ## 4 C-X-C motif chemokine ligand 11 136.6495 ## 5 cytidine/uridine monophosphate kinase 2 338.7044 ## 6 ISG15 ubiquitin like modifier 755.3958 ## log2FoldChange lfcSE stat pvalue padj symbol entrez ## 1 3.405602 0.1994352 17.07623 2.230680e-65 3.564403e-61 IFI6 2537 ## 2 3.334884 0.2138449 15.59487 7.887986e-55 6.302107e-51 IFIT1 3434 ## 3 3.082872 0.2008791 15.34691 3.714854e-53 1.978655e-49 IFI44L 10964 ## 4 5.959395 0.3940944 15.12174 1.164190e-51 3.720517e-48 CXCL11 6373 ## 5 2.758852 0.1936366 14.24758 4.641192e-46 8.240179e-43 CMPK2 129607 ## 6 3.766232 0.2743033 13.73017 6.697215e-43 1.070148e-39 ISG15 9636 </code></pre> --- ## Heatmap of the DE genes <!-- --> --- ## Plotting individual genes .pull-left[ <!-- --> ] -- .pull-right[ <!-- --> ] --- ## What next? * What is our research question? * How do we get from genes to functions or pathways? * How do we validate our data? --- # Clustering * "unbiased" machine learning * group data points (samples, genes, ...) by similarity * applications in many areas of bioinformatics --- .pull-left[ ## Why cluster the genes? * Genes which are co-regulated have similar function `\(\rightarrow\)` identification of "transcriptional modules" * Analysis of the clusters may give us clues about what is happening in the biological system * Better visualization if genes are clustered ] .pull-right[ <img src="weiner_BE_22_lecture_08_files/figure-html/unnamed-chunk-19-1.png" width="60%" /> ] --- class:empty-slide,mywhite background-image:url(images/network_1.png) --- class:empty-slide,mywhite background-image:url(images/network_2.png) --- ### From correlations to distances `\(d_{i,j}\)` – Distance between genes `\(i\)` and `\(j\)` `\(\rho(x_i, x_j)\)` – correlation coefficient between expression of `\(i\)` `\(x_i\)` and `\(j\)` `\(x_j\)` Hard threshold: `$$d_{i,j} \equiv \left\{\begin{array}{ll}0 & |\rho(x_i, x_j)| > \tau\\1 & |\rho(x_i, x_j)| \leq \tau\end{array}\right.$$` Continuouos in `\((0, 1)\)` `$$d_{i,j} \equiv 1 - |\rho(x_i, x_j)|$$` --- class:empty-slide,mywhite background-image:url(images/correlation_modules.png) --- class:empty-slide,mywhite background-image:url(images/correlation_dependence.png) --- ## Many alternatives Other measures exist: * mutual information * different types of correlation (Spearman `\(\rho\)`, Kendall `\(\tau\)`) * distance correlation (Székely 2007) Bottom line: we need to get at distances --- class:empty-slide,myinverse background-image:url(images/desert2.jpg) .mytop[ The number of grains of sand in this picture accurately represents the number of clustering algorithms. ] -- .mytop[ <br/> <br/> <br/> <br/> <br/> Each tree in the picture corresponds to a method of clustering which is robust, reliable and provides automatically high quality clusters. ] --- class:empty-slide,myinverse background-image:url(images/desert2.jpg) .pull-left[ .transpblock[ **PAM** Partitioning Around Medoids. Similar to k-means; predefined number of clusters ] .transpblock[ **SOMs**, Self-organizing maps: Train a neural network to recognize clusters in the data (also SOTA: self-organizing trees) ] .transpblock[ **SVC** – support vector clustering; based on SVM’s (support vector machines) ] .transpblock[ **Fuzzy C-means clustering**: Each sample gets assigned a probability of belonging to each of the clusters. Algorithm similar to k-means. Also similar – “soft k-means” ] ] .pull-right[ .transpblock[ **Hierarchical clustering:** UPGMA (group-average linkage), Ward's method, neighbor joining, single linkage, complete linkage... ] .transpblock[ **Mclust** – model based clustering The data is fit to a statistical model consisting of K normal distributions. ] .transpblock[ **Density based clustering:** identify clusters by local density profiles ] ] --- ### Hierarchical clustering * Start with each element in its own cluster: `\(n_{\text{clust}} = N\)` * Identify two elements with the smallest distance * Recalculate distances using a *linkage* function E.g. complete linkage: `$$D(C_i, C_j) = \max_{e_i \in C_i, e_j \in C_j} D(e_i, e_j)$$` Basically: join the clusters with the smalles distance between the *furthest* elements of the clusters E.g. single linkage: `$$D(C_i, C_j) = \min_{e_i \in C_i, e_j \in C_j} D(e_i, e_j)$$` Basically: join the clusters with the smalles distance between the *closest* elements of the clusters --- class:empty-slide,mywhite background-image:url(images/hclust.png) --- ## k-means Naive k-means: * Assign the elements to random clusters * Repeat: * Calculate the cluster centroids (midpoints) * Assign the elements to the cluster with the closest centroid * Abort if the asignments no longer change .center[ <!-- --> ] --- .center[ <!-- --> ] --- ## Measuring clustering performance .pull-left[ * How many clusters? * Which clustering method? * What parameters? ] -- .pull-right[ Clustering performance measures: * Internal: information scientific * Stability measures: cross-validation, resampling (like bootstrapping in phylogenies!) * External: use external information * a priori information * biological information (how well the clusters correspond to underlying biology) ] --- class:empty-slide,myinverse background-image:url(images/desert2.jpg) -- .mytop[ The number of grains of sand in this picture accurately represents the number of available measures of clustering performance. ] --- ## The elbow method .pull-left[ There is a total amount of variance in the clusters. We can split it into *between cluster* variance and *within cluster* variance. Instead of variance, we use Sum of Squares (SS), because `\(Var(x)\equiv \frac{SS}{n-1}\)`. `$$SS_{\text{tot}} = SS_{\text{within}} + SS_{\text{between}}$$` * If all elements are in one large cluster, `\(SS_{\text{between}}= 0\)`. * If all elements are each in a separate cluster, `\(SS_{\text{within}}= 0\)`. ] .pull-right[  ] --- ## Silhouette plots .pull-left[ For each element, we plot the difference between distance to the center of the cluster where that element is and the distance to the closest *other* cluster center. ] .pull-right[ <!-- --> ] --- <!-- --> --- <!-- --> --- <!-- --> --- Another example: clustering of single cell data  --- Another example: clustering of single cell data  --- Another example: clustering of single cell data 