class: center, middle, inverse, title-slide .title[ # Practicals 04: Gentle introduction to programming and R ] .subtitle[ ## BE_22 Bioinformatics SS 21 ] .author[ ### January Weiner ] .date[ ### 2024-05-10 ] --- --- ## Questionnaire Please fill out the following questionnaire: [https://forms.gle/XDVMAZhQ9qPKBmYQ8](https://forms.gle/XDVMAZhQ9qPKBmYQ8) <!-- --> --- ## Introduction This session is going to be a little bit different. You will observe what I do on the screen rather then be left to do the exercises. **Aims:** * Give you a jump start * Give you good habits * Proceed along a helix --- ## Installing R * in Linux: pretty straightforward, install R (preferably >= 4.1) then R studio. * in Windows: - install R (>= 4.1) - install R studio (rstudio.com) - install R tools --- ## Example R session --- ## Workspaces Workspace is basically a folder which contains a few special files in which R stores project-specific data. * `Rhistory` (hidden file) – a text file containing all commands that you have issued * `Rdata` (hidden file) – a binary file containing your workspace (all variables created) * `<filename>.Rproj` – Rstudio R project file containing some rstudio-specific settings (text file) * Anything else should be save by you --- ## Exercise * Start R studio `\(\rightarrow\)` File `\(\rightarrow\)` New project `\(\rightarrow\)` New directory `\(\rightarrow\)` New project and create a new project. * Examine the Files pane (lower right); what does it indicate? * Try to open the ".Rproj" file in a text editor. * Go to File `\(\rightarrow\)` New File `\(\rightarrow\)` `\(\rightarrow\)` R Script to create a new R script. * In the new file, write a simple R statement, for example: ``` a <- 1:10 ``` * Press Ctrl/Cmd-Enter. * What happens? What do you see in the console? * What do you see under "Environment" on the top right? --- ## A few notes on R * Why programming? * Why R? * Alternatives: Python, matlab, other statistical languages * R vs matlab * "There is more than one way of doing it" (but one way will usually be optimal) --- ## R language basics (demo) * Assignment and variables: `a <- 2` or `myBigVar <- "test"` * Note: there are no "singletons", everything is a vector (but maybe of length one) * vectors and multiple assignment: `a <- c(1, 7, 9)` * operators: * `3 + 5` * `a * 7` * `5 / /` * `5 %% 7` * `5 %/% 7` * `5^7` * operators often are "vectorized", that is they work with vectors: * `c(1, 5, 7) * 8` * functions: `sum(c(1, 2, 3))` * basically everything is a function, even the operators --- ## Exercises * create variables: a string, a number using the `<-` operator * how to create the variables? * how to view the variables? * what does `1:5` do? * what happens when you add a number to a vector? (i.e. `c(3, 1, 4) + 5`) * what happens when you multiply a vector with a number? * what happens when you add two vectors? * What does the `length()` function do? --- ## Remember: language is communication * Your code will be seen by others * And this is a good thing! * Documentation *is* important * Reproducibility matters --- ## Literate programming .pull-left[  ] .pull-right[ I believe that the time is ripe for significantly better documentation of programs, and that we can best achieve this by considering programs to be works of literature. Hence, my title: “Literate Programming.” Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do. (Donald E. Knuth) ] --- ## Exercise * Go to `File` `\(\rightarrow\)` `New File` `\(\rightarrow\)` rmarkdown file * Experiment with the file: * Convert it to HTML * Convert it to PDF * Convert it to Word document * Try changing something: title, text, R code * Click on the little wheel (configure) buttons next to code chunks, see what they do * Create a presentation --- ## Documenting your code * Better a lousy documentation than none at all * Use spaces, empty lines, comments to structure your code * COMMENT, COMMENT, COMMENT * Document in plain text files and source code files --- ## Writing code Keep your code clean: * be consistent! * use meaningful variable and function names * don't use shorthands * refactorize * create distinct code chunks * split the code into meaningful scripts * use a formatting style and stick to it E.g. http://web.stanford.edu/class/cs109l/unrestricted/resources/google-style.html --- ## Simplify! * make your code as simple as possible * make your functions versatile and simple * use simple data types if possible * don't overdo it! --- ## You never know * what your code evolves into * when you will need to publish it * when someone will want to see it …so prepare in advance! --- ## R language basics (demo) Assignment and variables: `a <- 2` or `myBigVar <- "test"` Note: there are no "singletons" (single values), everything is a vector (but often of length one) .footnote[Assignment assigns a value (vector or another data type) to a "label"; just like in maths. This "label" is called a *variable*] --- ## R data structures The basic data structure in R is a *vector*. It is just a number of variables of the same type. Types of vector variables: * integer (..., -1, 0, 1, 2, ...) * numeric (floating point a.k.a. double) * character (string) There are other basic data structures (list, data frame) and many, many specialized *classes*. We will figure them out when we come across them. --- ## R operators * operators: * `3 + 5` * `a * 7` * `5 / 7` * `5 %% 7` * `5 %/% 7` * `5^7` * operators often are "vectorized", that is they work with vectors: * `c(1, 5, 7) * 8` --- ## R functions R functions work very much like mathematical functions. They take certain arguments and return a value. * functions: `sum(c(1, 2, 3))` * basically everything is a function, even the operators .footnote[In many other programming languages it is common to have so called *side effects*, that is, a function modifies something in the environment. This can be done in R, but you should not do it. R is a functional language!] --- ## Exercises * create variables: a string, a number using the `<-` operator * how to create the variables? * how to view the variables? * what does `1:5` do? * what happens when you add a number to a vector? (i.e. `c(3, 1, 4) + 5`) * what happens when you multiply a vector with a number? * what happens when you add two vectors? * What does the `length()` function do? --- ## Exercise: Working with markdown 1. Try out the following: creating **bold** (`**bold**`), *italics* (`*italics*`), monospace font (`\`function\``) 2. Make a footnote (not in Jupyter): Text with a footnote^[This is the footnote.]. 3. Download an image to your working directory (not in Jupyter!). Insert it into the rmarkdown document:  3. For Jupyter: simply drag & drop an image into a cell. (or use the HTML syntax, but the file must be in the same folder as the jupyter notebook file or in a subdirectory) 4. Create a bullet point and numbered list. What happens if you use different numbers or change the order of the numbers? 5. Create a link to Google scholar with [Google scholar](https://scholar.google.com) --- ## Exercise: Lists * Create some lists: ``` * one * two * three 1. one 2. two 3. three ``` --- ## Working with markdown 4. Create a simple table, like this: ``` Right Left Center Default ------- ------ ---------- ------- 12 12 12 12 123 123 123 123 1 1 1 1 ``` .footnote[In Jupyter Notebook, the markdown has a slightly different syntax. See [https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html](https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html)] --- ## Better tables with Rmarkdown Two colons mean that we access a function from the `knitr` package without loading it with `library`. ```r knitr::kable(iris) ``` --- ## Analysis of the iris data set (simple stats) Before you start doing *anything*, take a look at the iris data set (if you don't see it, run `data(iris)`). That is, run a few descriptive analyses. **Note:** Store *every* analysis in Rmarkdown, that is the point of the exercise. Add your comments! 1. Use the function `summary()` on the iris data set. 2. Create a boxplot: boxplot(Sepal.Width ~ Species, data=iris) --- ## Analysis of the iris data set (simple stats) Before we come to hypothesis testing, let us make a PCA (principal component analysis). We will use the build-in data set iris, which is a data frame. The first four columns contain the measurements, the fifth is the only covariate (species). ```r pca <- prcomp(iris[,1:4], scale.=TRUE) ``` --- ## Analysis of the iris data set (simple stats) 1. The pca object is a list. What elements does it contain? How can you check that? (try the `names` function) 2. The element `x` (`pca$x`) contains the principal components. How many principal components are there? (why?) 3. Try using the basic plot function `plot(x, y)` to plot first two components. 4. The `col` option of `plot` takes a character or numeric vector with colors (you can use `red`, or you can use `1`, both work). Can you create a column in the `iris` data frame containing colors that correspond to the species? --- ## Plotting with ggplot2 (very basic!) .pull-left[ ```r df <- cbind(iris, pca$x) library(ggplot2) ggplot(df, aes(x=PC1, y=PC2, color=Species)) + geom_point() ``` ] -- .pull-right[ <!-- --> ] --- .pull-left[ ```r df <- cbind(iris, pca$x) library(ggplot2) ggplot(df, aes(x=PC1, y=PC2, color=Species)) + geom_point() + ggthemes::theme_tufte() ``` <!-- --> ] .pull-right[ ```r df <- cbind(iris, pca$x) library(ggplot2) ggplot(df, aes(x=PC1, y=PC2, color=Species)) + geom_point() + ggthemes::theme_economist_white() ``` <!-- --> ] --- ## Very basic statistics First, a t.test to check whether the differences in Sepal Width between *I. versicolor* and *I. virginica* are statistically significant. ```r versicolor <- iris$Species == "versicolor" ver_sw <- iris$Sepal.Width[versicolor] virginica <- iris$Species == "virginica" vir_sw <- iris$Sepal.Width[virginica] t.test(ver_sw, vir_sw) ``` ``` ## ## Welch Two Sample t-test ## ## data: ver_sw and vir_sw ## t = -3.2058, df = 97.927, p-value = 0.001819 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -0.33028364 -0.07771636 ## sample estimates: ## mean of x mean of y ## 2.770 2.974 ``` --- ## t.test object 1. Save the output of t.test to a variable. (use the assignment operator) 2. Examine this variable. Is that a list? What elements does it have? Which elements correspond to the output? 3. Load the library `broom`. What happens when you do tidy(t.test(ver_sw, vir_sw)) --- ## Very basic statistics First, a t.test to check whether the differences in Sepal Width between *I. versicolor* and *I. virginica* are statistically significant. ```r versicolor <- iris$Species == "versicolor" ver_sw <- iris$Sepal.Width[versicolor] virginica <- iris$Species == "virginica" vir_sw <- iris$Sepal.Width[virginica] t.test(ver_sw, vir_sw) ``` --- ## Very basic statistics A pairwise t-test between each group ```r pairwise.t.test(iris$Sepal.Width, iris$Species) ``` --- ## Very basic statistics Next, a simple ANOVA to check whether Species is a significant covariate. ```r iris_anova <- aov(Sepal.Width ~ Species, data=iris) summary(iris_anova) ``` ``` ## Df Sum Sq Mean Sq F value Pr(>F) ## Species 2 11.35 5.672 49.16 <2e-16 *** ## Residuals 147 16.96 0.115 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- ## ANOVA post-hoc Tukey's test ```r TukeyHSD(iris_anova, "Species") ``` ``` ## Tukey multiple comparisons of means ## 95% family-wise confidence level ## ## Fit: aov(formula = Sepal.Width ~ Species, data = iris) ## ## $Species ## diff lwr upr p adj ## versicolor-setosa -0.658 -0.81885528 -0.4971447 0.0000000 ## virginica-setosa -0.454 -0.61485528 -0.2931447 0.0000000 ## virginica-versicolor 0.204 0.04314472 0.3648553 0.0087802 ``` --- ## Adding bibliography (in Rmarkdown) Bibliography must be found at the end of the document. There are basically two ways of dealing with bibliography: 1. Manual: you simply type in the text (Weiner et al. 2022) and then place the corresponding bibliographic entry at the end of the document. This is easier in the very, very short run. 2. Use a bibliography manager. This is easy in Rmarkdown (it is also possible in Jupyter) --- ## Adding bibliography (in Rmarkdown) 1. Create a new text file called "bibliography.bib". You can do that in Rstudio for some OS (simply create a new .R file, but save it as "bibliography.bib"). In other OS (notably Windows), you might need to create an empty text file and then rename it to "bibliography.bib". 2. Open the bibliography.bib file in Rstudio or a text editor. Go to google scholar and search for a paper on an interesting topic (whatever comes to your mind. If you don't know what, look for the paper by Watson and Crick on DNA structure). 3. Click on the citation icon `"`, and then on "BibTex". Copy the contents to your newly created bibliography file and save it. Note what the identifier of the article is (something like `watson1953structure`). 4. In your markdown file, add the line `bibliography: bibliography.bib` to the header. 5. In the body of the markdown file, add somewhere the citation: `[@watson1953structure]`. Knit the document and look at the output. 6. What happens if you use `@watson1953structure` (without brackets), `[-@watson1953structure]` or `[@watson1953structure; page 1]`? 7. Go to [https://www.zotero.org/styles](https://www.zotero.org/styles) and search for the citation style of "Nature". Download it to your working directory. To the markdown header add the line `csl: nature.csl` (or whatever the file is called). Knit and look at the output. --- ## Adding bibliography (in Jupyter) This seems not to be very straightforward. If you do not to make manual references, please start with reading [this documentation](https://nbviewer.org/github/ipython/nbconvert-examples/blob/master/citations/Tutorial.ipynb) and [this answer](https://stackoverflow.com/questions/30061902/how-to-handle-citations-in-ipython-notebook) from StackOverflow. --- ## Homework: Rmarkdown analysis of the iris data set * Follow the instructions from the practicals to generate an Rmarkdown file with the analysis of the iris data set. * Add short comments for each step, indicating what the step is doing. * Make sure the Rmd file can be knitted. * Submit the Rmd file (only the Rmd file) as your homework. Alternatively, use Jupyter Notebook: * Create a new Jupyter Notebook * As above: repeat the steps from the practicals * Add short comments for each step, indicating what the step is doing. * Submit your ipynb file as your homework.