x <- 5
print(x)
## [1] 5

Todays goal is to reinforce the previous practicals with R, answer questions, check your understanding. Also, we will do some basic, example statistics.

1 Self-check

  • What is the difference between an R script and an Rmarkdown document?
  • What is a workspace? What is a session? What is a project?
  • Can you create an R script? An Rmarkdown document?
  • Can you convert an R markdown document to an HTML document? Word document? PDF document?

Exercise 5.1.

Note: Store everything you do in Rmarkdown, that is the point of the exercise. Add your comments!

  • Create a new Rmarkdown document.
  • Remove the contents, but keep the header.
  • Add a title and a subtitle.
  • Insert a code chunk.
  • In the code chunk, create a variable x with value 5.
  • Insert a second code chunk.
  • In the second code chunk, print the value of x using the print() function.

2 The iris data set.

The famous iris data set is included in R. It contains measurements of the sepal and petal length and width of three species of iris flowers. The data set has been collected by the botanist Edgar Anderson, and since then it has become a standard data set in statistics and machine learning because of its simplicity and availability. It was made famous by the statistician Ronald Fisher, who used it to illustrate the linear discriminant analysis in 1936.

We load the data set as follows:

data(iris)

It should now appear on the right hand side of the RStudio window in the “Environment” tab. You can click on the name of the data set to see its contents or use the View() function.

Exercise 5.2

  • Find out what are the petals and sepals of a flower.
  • How many and what variables do we have in the iris data set? (Hint: use the names() function)
  • How do you access a column of these variables? (Hint: use the $ operator)
  • How do you access first 10 rows of the iris data set? (Hint: use the head() function)
  • What is the type of these variables? (Hint: use the class() or typeof() function)
  • How many observations do we have in the iris data set? (Hint: use the nrow() or dim() functions)
  • Use the summary() function to get a summary of the data set.

3 Very basic statistics

First, a simple t.test to check whether the Sepal.Width of the setosa species is significantly different from the versicolor species.

# select only values from the Sepal.Width column where the Species is setosa
setosa_sw     <- iris$Sepal.Width[ iris$Species == "setosa" ]

# same, but for versicolor and using the "subset" function
versicolor_sw <- subset(iris$Sepal.Width, iris$Species == "versicolor")

t.test(setosa_sw, versicolor_sw)
## 
##  Welch Two Sample t-test
## 
## data:  setosa_sw and versicolor_sw
## t = 9.455, df = 94.698, p-value = 2.484e-15
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.5198348 0.7961652
## sample estimates:
## mean of x mean of y 
##     3.428     2.770

Exercise 5.3

  • What is the null (zero) hypothesis of the t.test?
  • What is the alternative hypothesis of the t.test?
  • Repeat the t.test for Petal Width and comparison between versicolor and virginica.

A pairwise t-test between each group:

pairwise.t.test(iris$Sepal.Width, iris$Species)
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  iris$Sepal.Width and iris$Species 
## 
##            setosa  versicolor
## versicolor < 2e-16 -         
## virginica  9.1e-10 0.0031    
## 
## P value adjustment method: holm

Warning: this is not a paired t-test, a paired t-test is something else (when the data comes in pairs!).

3.1 Analysis of Variance (ANOVA)

Like a t-test, but for multiple groups and multiple variables.

boxplot(Sepal.Width ~ Species, data=iris)

Next, a simple ANOVA to check whether Species is a significant covariate.

iris_anova <- aov(Sepal.Width ~ Species, data=iris)
summary(iris_anova)
##              Df Sum Sq Mean Sq F value Pr(>F)    
## Species       2  11.35   5.672   49.16 <2e-16 ***
## Residuals   147  16.96   0.115                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can also use the broom package to make a slightly nicer table with results:

library(broom)
tidy(iris_anova)
## # A tibble: 2 × 6
##   term         df sumsq meansq statistic   p.value
##   <chr>     <dbl> <dbl>  <dbl>     <dbl>     <dbl>
## 1 Species       2  11.3  5.67       49.2  4.49e-17
## 2 Residuals   147  17.0  0.115      NA   NA

3.2 ANOVA post-hoc Tukey’s test

The Tukey’s test is a post-hoc test that can be used to compare all pairs of groups. It is a pairwise t-test with a correction for multiple testing.

TukeyHSD(iris_anova, "Species")
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Sepal.Width ~ Species, data = iris)
## 
## $Species
##                        diff         lwr        upr     p adj
## versicolor-setosa    -0.658 -0.81885528 -0.4971447 0.0000000
## virginica-setosa     -0.454 -0.61485528 -0.2931447 0.0000000
## virginica-versicolor  0.204  0.04314472  0.3648553 0.0087802

3.3 Correlation analysis

cor.test(iris$Sepal.Width, iris$Sepal.Length)
## 
##  Pearson's product-moment correlation
## 
## data:  iris$Sepal.Width and iris$Sepal.Length
## t = -1.4403, df = 148, p-value = 0.1519
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.27269325  0.04351158
## sample estimates:
##        cor 
## -0.1175698

The cor.test function runs a test for correlation. Different tests are available, but the default is the Pearson correlation test. A common alternative is the Spearman correlation test, which is a non-parametric test. The Spearman correlation test is more robust to outliers and non-normality, but it is less powerful than the Pearson correlation test when the data are normally distributed.

cor.test(iris$Sepal.Width, iris$Sepal.Length, method="spearman")
## Warning in cor.test.default(iris$Sepal.Width, iris$Sepal.Length, method = "spearman"): Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  iris$Sepal.Width and iris$Sepal.Length
## S = 656283, p-value = 0.04137
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.1667777

4 Visualisation in R.

We will start with very simple commands to visualize the data. R has (basically) two main graphics systems: the simple, base graphics and the more advanced ggplot2 system. We will start with the base graphics and use the ggplot2 system later. There are reasons to know both systems: the base graphics are very simple and easy to use, but the ggplot2 system is more powerful and allows for more complex visualizations.

plot(iris$Sepal.Length, iris$Sepal.Width)

Now we add the color of the points according to the species of the flower.

The following works only because iris$Species is a factor. Factors look like strings, but underneath they are integers. The integers in the plot function correspond to colors: 1 is black, 2 is red etc. The pch parameter sets the shape of the points.

plot(iris$Sepal.Length, iris$Sepal.Width, col=iris$Species, pch=19)

However, at this point it is much more convenient to use the ggplot2:

library(ggplot2)
ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, color=Species)) + geom_point()

The first part sets up the plotting object. Then, we “add” geoms, that is, geometrical representations of the data.

Here is a trick to change the default theme:

theme_set(theme_minimal())
ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, color=Species)) + geom_point()

4.1 Principal component analysis (PCA)

PCA is a method to reduce the dimensionality of the data (“dimensionality reduction method”). It is a very useful method to visualize the data and to find out which variables are most important in the data. It is also useful to find out whether there are any outliers in the data.

pca <- prcomp(iris[,1:4])

The result of the PCA is a list. We can see the contents of the list with the str() function:

str(pca)
## List of 5
##  $ sdev    : num [1:4] 2.056 0.493 0.28 0.154
##  $ rotation: num [1:4, 1:4] 0.3614 -0.0845 0.8567 0.3583 -0.6566 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
##   .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
##  $ center  : Named num [1:4] 5.84 3.06 3.76 1.2
##   ..- attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
##  $ scale   : logi FALSE
##  $ x       : num [1:150, 1:4] -2.68 -2.71 -2.89 -2.75 -2.73 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
##  - attr(*, "class")= chr "prcomp"

We can access the objects in the list with the $ operator:

pca$center
pca$scale
pca$rotation
pca$x

The pca$x object is what really interests us. It contains the coordinates of the data in the PCA space. We can plot the first two principal components as follows:

plot(pca$x[,1],pca$x[,2])

However, in order to use ggplot2, we must convert it to a data frame:

pca_df <- as.data.frame(pca$x)
## add the species column
pca_df$Species <- iris$Species
ggplot(pca_df, aes(x=PC1, y=PC2, color=Species)) + geom_point()

Exercise 5.5

Create a data frame which contains both the iris data (the first four columns) and the PCA coordinates (the first two columns of pca$x), as well as the species column. Make plots which combine a principal component (e.g. PC1) and one of the variables. Do you see any correlation between the variables and the principal components?

5 Linear regression analysis

5.1 Simple regression

The basic regression equation:

\[y = \beta_0 + \beta_1 x + \epsilon\]

Here is how we do it in R. We use an example data set from R called “state.x77”. It contains various statistics about the US states from 1977.

library(datasets)
## convert a matrix to a data frame
df <- as.data.frame(state.x77)
ggplot(df, aes(x=Income, y=Illiteracy)) + geom_point()

Now we run a linear regression:

lm_fit <- lm(Illiteracy ~ Income, data=df)
summary(lm_fit)
## 
## Call:
## lm(formula = Illiteracy ~ Income, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.79927 -0.46481 -0.09793  0.34011  1.24378 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.0932014  0.5765787   5.365  2.3e-06 ***
## Income      -0.0004336  0.0001288  -3.367  0.00151 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5539 on 48 degrees of freedom
## Multiple R-squared:  0.191,  Adjusted R-squared:  0.1742 
## F-statistic: 11.34 on 1 and 48 DF,  p-value: 0.001505
tidy(lm_fit)
## # A tibble: 2 × 5
##   term         estimate std.error statistic    p.value
##   <chr>           <dbl>     <dbl>     <dbl>      <dbl>
## 1 (Intercept)  3.09      0.577         5.36 0.00000230
## 2 Income      -0.000434  0.000129     -3.37 0.00151

We can also plot the regression line:

ggplot(df, aes(x=Income, y=Illiteracy)) + geom_point() + geom_smooth(method="lm")
## `geom_smooth()` using formula = 'y ~ x'

5.2 Multiple regression

Unlike in other programs, it is much simpler to make complex regression models in R.

df <- as.data.frame(state.x77)

# clean up the column names. You can also use the function
# `clean_names` from the package `janitor` to do it automatically.
colnames(df) <- c("pop", "income", "illiteracy",
                  "life_exp", "murder", "hs_grad", "frost", "area")
# alternative: df <- clean_names(df)
lm_fit <- lm(life_exp ~ income + illiteracy + murder, data=df)
summary(lm_fit)
## 
## Call:
## lm(formula = life_exp ~ income + illiteracy + murder, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.49981 -0.61786 -0.01735  0.48470  2.17123 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 71.1638497  1.0990196  64.752  < 2e-16 ***
## income       0.0003811  0.0002173   1.754   0.0862 .  
## illiteracy   0.0368692  0.2997928   0.123   0.9027    
## murder      -0.2736317  0.0457495  -5.981 3.09e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8347 on 46 degrees of freedom
## Multiple R-squared:  0.6371, Adjusted R-squared:  0.6134 
## F-statistic: 26.92 on 3 and 46 DF,  p-value: 3.336e-10

5.3 ANOVA with linear regression

ANOVA is practically equivalent to a simple regression model, where x is categorical.

iris_anova <- aov(Sepal.Width ~ Species, data=iris)
summary(iris_anova)
##              Df Sum Sq Mean Sq F value Pr(>F)    
## Species       2  11.35   5.672   49.16 <2e-16 ***
## Residuals   147  16.96   0.115                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
iris_lm <- lm(Sepal.Width ~ Species, data=iris)

summary(iris_lm)
## 
## Call:
## lm(formula = Sepal.Width ~ Species, data = iris)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.128 -0.228  0.026  0.226  0.972 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        3.42800    0.04804  71.359  < 2e-16 ***
## Speciesversicolor -0.65800    0.06794  -9.685  < 2e-16 ***
## Speciesvirginica  -0.45400    0.06794  -6.683 4.54e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3397 on 147 degrees of freedom
## Multiple R-squared:  0.4008, Adjusted R-squared:  0.3926 
## F-statistic: 49.16 on 2 and 147 DF,  p-value: < 2.2e-16
anova(iris_lm)
## Analysis of Variance Table
## 
## Response: Sepal.Width
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## Species     2 11.345  5.6725   49.16 < 2.2e-16 ***
## Residuals 147 16.962  0.1154                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The summary(iris_lm) is not exactly what we want, because rather than showing whether there are differences between the groups, it checks whether the individual groups differ from the first group (setosa). We need to use the function anova on the regression model to actually get the same results as with aov. This function is pretty general and we can apply it in many cases.

6 Adding bibliography (in Rmarkdown)

Bibliography must be found at the end of the document. There are basically two ways of dealing with bibliography:

  1. Manual: you simply type in the text (Weiner et al. 2022) and then place the corresponding bibliographic entry at the end of the document. This is easier in the very, very short run.

  2. Use a bibliography manager. This is easy in Rmarkdown (it is also possible in Jupyter)

6.1 Adding bibliography (in Rmarkdown)

Example: Altschul paper (Altschul et al. 1997)

  1. Create a new text file called “bibliography.bib”. You can do that in Rstudio for some OS (simply create a new .R file, but save it as “bibliography.bib”). In other OS (notably Windows), you might need to create an empty text file and then rename it to “bibliography.bib”.
  2. Open the bibliography.bib file in Rstudio or a text editor. Go to google scholar and search for a paper on an interesting topic (whatever comes to your mind. If you don’t know what, look for the paper by Watson and Crick on DNA structure).
  3. Click on the citation icon ", and then on “BibTex”. Copy the contents to your newly created bibliography file and save it. Note what the identifier of the article is (something like watson1953structure).
  4. In your markdown file, add the line bibliography: bibliography.bib to the header.
  5. In the body of the markdown file, add somewhere the citation: [@watson1953structure]. Knit the document and look at the output.
  6. What happens if you use @watson1953structure (without brackets), [-@watson1953structure] or [@watson1953structure; page 1]?
  7. Go to https://www.zotero.org/styles and search for the citation style of “Nature”. Download it to your working directory. To the markdown header add the line csl: nature.csl (or whatever the file is called). Knit and look at the output.
Altschul, Stephen F, Thomas L Madden, Alejandro A Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J Lipman. 1997. “Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs.” Nucleic Acids Research 25 (17): 3389–3402.