Practicals 04: Gentle introduction to programming and R

class: center, middle, inverse, title-slide

.title[
# Practicals 04: Gentle introduction to programming and R
]
.subtitle[
## BE_22 Bioinformatics SS 21
]
.author[
### January Weiner
]
.date[
### 2024-05-10
]

---

## Questionnaire

Please fill out the following questionnaire:

[https://forms.gle/XDVMAZhQ9qPKBmYQ8](https://forms.gle/XDVMAZhQ9qPKBmYQ8)

![](weiner_BE_22_practicals_04_exercises_files/figure-html/unnamed-chunk-2-1.png)

---

## Introduction

This session is going to be a little bit different. You will observe what I
do on the screen rather then be left to do the exercises.

**Aims:**

* Give you a jump start
 * Give you good habits
 * Proceed along a helix

---

## Installing R

* in Linux: pretty straightforward, install R (preferably >= 4.1) then R studio.
 * in Windows: 
   - install R (>= 4.1)
   - install R studio (rstudio.com)
   - install R tools
---

## Example R session

---

## Workspaces

Workspace is basically a folder which contains a few special files in
 which R stores project-specific data.

* `Rhistory` (hidden file) – a text file containing all commands that you
    have issued
  * `Rdata` (hidden file) – a binary file containing your workspace (all
    variables created)
  * `<filename>.Rproj` – Rstudio R project file containing some
    rstudio-specific settings (text file)
  * Anything else should be save by you

---

## Exercise

* Start R studio `$\rightarrow$` File `$\rightarrow$` New project
    `$\rightarrow$` New directory `$\rightarrow$` New project and create a new
    project.
  * Examine the Files pane (lower right); what does it indicate?
  * Try to open the ".Rproj" file in a text editor. 
  * Go to File `$\rightarrow$` New File `$\rightarrow$` `$\rightarrow$` R Script
    to create a new R script.
  * In the new file, write a simple R statement, for example:

```
a <- 1:10
```

* Press Ctrl/Cmd-Enter. 
   * What happens? What do you see in the console? 
   * What do you see under "Environment" on the top right?

---

## A few notes on R

* Why programming?
 * Why R?
 * Alternatives: Python, matlab, other statistical languages
 * R vs matlab
 * "There is more than one way of doing it" (but one way will usually be
   optimal)

---

## R language basics (demo)

* Assignment and variables: `a <- 2` or `myBigVar <- "test"`
  * Note: there are no "singletons", everything is a vector (but maybe of
    length one)
 * vectors and multiple assignment: `a <- c(1, 7, 9)`
 * operators: 
   * `3 + 5` 
   * `a * 7` 
   * `5 / /` 
   * `5 %% 7` 
   * `5 %/% 7` 
   * `5^7`
   * operators often are "vectorized", that is they work with vectors:
      * `c(1, 5, 7) * 8`
 * functions: `sum(c(1, 2, 3))`
   * basically everything is a function, even the operators

---

## Exercises

* create variables: a string, a number using the `<-` operator
   * how to create the variables?
   * how to view the variables?
 * what does `1:5` do?
 * what happens when you add a number to a vector? (i.e. `c(3, 1, 4) + 5`)
 * what happens when you multiply a vector with a number?
 * what happens when you add two vectors?

* What does the `length()` function do?

---

## Remember: language is communication

* Your code will be seen by others
 * And this is a good thing!
 * Documentation *is* important
 * Reproducibility matters

---

## Literate programming

.pull-left[
![Donald E. Knuth](images/knuth.jpg)
]

.pull-right[
I believe that the time is ripe for significantly better documentation
of programs, and that we can best achieve this by considering
programs to be works of literature. Hence, my title: “Literate
Programming.”

Let us change our traditional attitude to the construction of
programs: Instead of imagining that our main task is to instruct a
computer what to do, let us concentrate rather on explaining to
human beings what we want a computer to do.

(Donald E. Knuth)
]

---

## Exercise

* Go to `File` `$\rightarrow$` `New File` `$\rightarrow$` rmarkdown file
 * Experiment with the file:
   * Convert it to HTML
   * Convert it to PDF
   * Convert it to Word document
   * Try changing something: title, text, R code
   * Click on the little wheel (configure) buttons next to code chunks, see
     what they do
 * Create a presentation

---

## Documenting your code

* Better a lousy documentation than none at all
 * Use spaces, empty lines, comments to structure your code
 * COMMENT, COMMENT, COMMENT
 * Document in plain text files and source code files

---

## Writing code

Keep your code clean:

* be consistent!
 * use meaningful variable and function names
 * don't use shorthands
 * refactorize
 * create distinct code chunks
 * split the code into meaningful scripts
 * use a formatting style and stick to it

E.g.
http://web.stanford.edu/class/cs109l/unrestricted/resources/google-style.html

---

## Simplify!

* make your code as simple as possible
 * make your functions versatile and simple
 * use simple data types if possible
 * don't overdo it!

---

## You never know

* what your code evolves into
 * when you will need to publish it
 * when someone will want to see it

…so prepare in advance!

---

## R language basics (demo)

Assignment and variables: `a <- 2` or `myBigVar <- "test"`

Note: there are no "singletons" (single values), everything is a vector
 (but often of length one)

.footnote[Assignment assigns a value (vector or another data type) to a
 "label"; just like in maths. This "label" is called a *variable*]

---

## R data structures

The basic data structure in R is a *vector*. It is just a number of variables of
the same type.

Types of vector variables:

* integer (..., -1, 0, 1, 2, ...)
 * numeric (floating point a.k.a. double)
 * character (string)

There are other basic data structures (list, data frame) and many, many
specialized *classes*. We will figure them out when we come across them.

---

## R operators

* operators: 
   * `3 + 5` 
   * `a * 7` 
   * `5 / 7` 
   * `5 %% 7` 
   * `5 %/% 7` 
   * `5^7`
   * operators often are "vectorized", that is they work with vectors:
      * `c(1, 5, 7) * 8`

---

## R functions

R functions work very much like mathematical functions. They take certain
arguments and return a value.

* functions: `sum(c(1, 2, 3))`
   * basically everything is a function, even the operators

.footnote[In many other programming languages it is common to have so
called *side effects*, that is, a function modifies something in the
environment. This can be done in R, but you should not do it. R is a
functional language!]

---

## Exercises

* What does the `length()` function do?

---

## Exercise: Working with markdown

1. Try out the following: creating **bold** (`**bold**`), *italics*
    (`*italics*`), monospace font (`\`function\``)

2. Make a footnote (not in Jupyter):

Text with a footnote^[This is the footnote.].
 
 3. Download an image to your working directory (not in Jupyter!). Insert it into the
    rmarkdown document:

![Image caption](image_file_name.jpg)

3. For Jupyter: simply drag & drop an image into a cell. (or use the HTML
    syntax, but the file must be in the same folder as the jupyter notebook
    file or in a subdirectory)

4. Create a bullet point and numbered list. What happens if you use
     different numbers or change the order of the numbers?

5. Create a link to Google scholar with

[Google scholar](https://scholar.google.com)

---

## Exercise: Lists

* Create some lists:

```
 * one
 * two 
 * three

1. one
 2. two
 3. three
```

---

## Working with markdown

4. Create a simple table, like this:

```
  Right     Left     Center     Default
-------     ------ ----------   -------
     12     12        12            12
    123     123       123          123
      1     1          1             1
```

.footnote[In Jupyter Notebook, the markdown has a slightly different syntax. See [https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html](https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html)]

---

## Better tables with Rmarkdown

Two colons mean that we access a function from the `knitr` package without
loading it with `library`.

```r
knitr::kable(iris)
```

---

## Analysis of the iris data set (simple stats)

Before you start doing *anything*, take a look at the iris data set (if you
don't see it, run `data(iris)`). That is, run a few descriptive analyses.

**Note:** Store *every* analysis in Rmarkdown, that is the point of the
exercise. Add your comments!

1. Use the function `summary()` on the iris data set.

2. Create a boxplot:

boxplot(Sepal.Width ~ Species, data=iris)

---

## Analysis of the iris data set (simple stats)

Before we come to hypothesis testing, let us make a PCA (principal
component analysis). We will use the build-in data set iris, which is a
data frame. The first four columns contain the measurements, the fifth is
the only covariate (species).

```r
pca <- prcomp(iris[,1:4], scale.=TRUE)
```

---

## Analysis of the iris data set (simple stats)

1. The pca object is a list. What elements does it contain? How can you
    check that? (try the `names` function)
 2. The element `x` (`pca$x`) contains the principal components. How many
    principal components are there? (why?)
 3. Try using the basic plot function `plot(x, y)` to plot first two
    components.
 4. The `col` option of `plot` takes a character or numeric vector with
    colors (you can use `red`, or you can use `1`, both work). Can you
    create a column in the `iris` data frame containing colors that
    correspond to the species?

---

## Plotting with ggplot2 (very basic!)

.pull-left[

```r
df <- cbind(iris, pca$x)
library(ggplot2)
ggplot(df, aes(x=PC1, y=PC2, color=Species)) + 
  geom_point()
```
]

.pull-right[

![](weiner_BE_22_practicals_04_exercises_files/figure-html/pcaplot-1.png)

]

---

.pull-left[

```r
df <- cbind(iris, pca$x)
library(ggplot2)
ggplot(df, aes(x=PC1, y=PC2, color=Species)) + 
  geom_point() + ggthemes::theme_tufte()
```

![](weiner_BE_22_practicals_04_exercises_files/figure-html/pcaplot2-1.png)
]

.pull-right[

```r
df <- cbind(iris, pca$x)
library(ggplot2)
ggplot(df, aes(x=PC1, y=PC2, color=Species)) + 
  geom_point() + ggthemes::theme_economist_white()
```

![](weiner_BE_22_practicals_04_exercises_files/figure-html/pcaplot3-1.png)
]

---

## Very basic statistics

First, a t.test to check whether the differences in Sepal Width between
*I. versicolor* and *I. virginica* are statistically significant.

```r
versicolor <- iris$Species == "versicolor"
ver_sw <- iris$Sepal.Width[versicolor]
virginica <- iris$Species == "virginica"
vir_sw <- iris$Sepal.Width[virginica]

t.test(ver_sw, vir_sw)
```

```
## 
## 	Welch Two Sample t-test
## 
## data:  ver_sw and vir_sw
## t = -3.2058, df = 97.927, p-value = 0.001819
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.33028364 -0.07771636
## sample estimates:
## mean of x mean of y 
##     2.770     2.974
```

---

## t.test object

1. Save the output of t.test to a variable. (use the assignment operator)

2. Examine this variable. Is that a list? What elements does it have?
    Which elements correspond to the output?

3. Load the library `broom`. What happens when you do

tidy(t.test(ver_sw, vir_sw))

---

## Very basic statistics

First, a t.test to check whether the differences in Sepal Width between
*I. versicolor* and *I. virginica* are statistically significant.

```r
versicolor <- iris$Species == "versicolor"
ver_sw <- iris$Sepal.Width[versicolor]
virginica <- iris$Species == "virginica"
vir_sw <- iris$Sepal.Width[virginica]

t.test(ver_sw, vir_sw)
```

---

## Very basic statistics

A pairwise t-test between each group

```r
pairwise.t.test(iris$Sepal.Width, iris$Species)
```

---

## Very basic statistics

Next, a simple ANOVA to check whether Species is a significant covariate.

```r
iris_anova <- aov(Sepal.Width ~ Species, data=iris)
summary(iris_anova)
```

```
##              Df Sum Sq Mean Sq F value Pr(>F)    
## Species       2  11.35   5.672   49.16 <2e-16 ***
## Residuals   147  16.96   0.115                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

---

## ANOVA post-hoc Tukey's test

```r
TukeyHSD(iris_anova, "Species")
```

```
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Sepal.Width ~ Species, data = iris)
## 
## $Species
##                        diff         lwr        upr     p adj
## versicolor-setosa    -0.658 -0.81885528 -0.4971447 0.0000000
## virginica-setosa     -0.454 -0.61485528 -0.2931447 0.0000000
## virginica-versicolor  0.204  0.04314472  0.3648553 0.0087802
```

---

## Adding bibliography (in Rmarkdown)

Bibliography must be found at the end of the document. There are basically
two ways of dealing with bibliography:

1. Manual: you simply type in the text (Weiner et al. 2022) and then place
    the corresponding bibliographic entry at the end of the document. This
    is easier in the very, very short run.

2. Use a bibliography manager. This is easy in Rmarkdown (it is also
    possible in Jupyter)

---

## Adding bibliography (in Rmarkdown)

1. Create a new text file called "bibliography.bib". You can do that in
    Rstudio for some OS (simply create a new .R file, but save it as
    "bibliography.bib"). In other OS (notably Windows), you might need to
    create an empty text file and then rename it to "bibliography.bib".
 2. Open the bibliography.bib file in Rstudio or a text editor. Go to
    google scholar and search for a paper on an interesting topic (whatever
    comes to your mind. If you don't know what, look for the paper by
    Watson and Crick on DNA structure).
 3. Click on the citation icon `"`, and then on "BibTex". Copy the contents
    to your newly created bibliography file and save it. Note what the
    identifier of the article is (something like `watson1953structure`).
 4. In your markdown file, add the line `bibliography: bibliography.bib` to
    the header. 
 5. In the body of the markdown file, add somewhere the citation:
    `[@watson1953structure]`. Knit the document and look at the output.
 6. What happens if you use `@watson1953structure` (without brackets),
    `[-@watson1953structure]` or `[@watson1953structure; page 1]`?
 7. Go to [https://www.zotero.org/styles](https://www.zotero.org/styles) and search
    for the citation style of "Nature".
    Download it to your working directory. To the markdown header add the
    line `csl: nature.csl` (or whatever the file is called). Knit and look
    at the output.

---

## Adding bibliography (in Jupyter)

This seems not to be very straightforward. If you do not to make manual
references, please start with reading 
[this documentation](https://nbviewer.org/github/ipython/nbconvert-examples/blob/master/citations/Tutorial.ipynb)
and [this answer](https://stackoverflow.com/questions/30061902/how-to-handle-citations-in-ipython-notebook) from StackOverflow.

---

## Homework: Rmarkdown analysis of the iris data set

* Follow the instructions from the practicals to generate an Rmarkdown
   file with the analysis of the iris data set. 
 * Add short comments for each step, indicating what the step is doing.
 * Make sure the Rmd file can be knitted.
 * Submit the Rmd file (only the Rmd file) as your homework.

Alternatively, use Jupyter Notebook:

* Create a new Jupyter Notebook
 * As above: repeat the steps from the practicals
 * Add short comments for each step, indicating what the step is doing.
 * Submit your ipynb file as your homework.