x <- 5
print(x)
## [1] 5
Todays goal is to reinforce the previous practicals with R, answer questions, check your understanding. Also, we will do some basic, example statistics.
Exercise 5.1.
Note: Store everything you do in Rmarkdown, that is the point of the exercise. Add your comments!
x
with value
5.x
using
the print()
function.The famous iris data set is included in R. It contains measurements of the sepal and petal length and width of three species of iris flowers. The data set has been collected by the botanist Edgar Anderson, and since then it has become a standard data set in statistics and machine learning because of its simplicity and availability. It was made famous by the statistician Ronald Fisher, who used it to illustrate the linear discriminant analysis in 1936.
We load the data set as follows:
data(iris)
It should now appear on the right hand side of the RStudio window in
the “Environment” tab. You can click on the name of the data set to see
its contents or use the View()
function.
Exercise 5.2
names()
function)$
operator)head()
function)class()
or typeof()
function)nrow()
or dim()
functions)summary()
function to get a summary of the data
set.First, a simple t.test to check whether the Sepal.Width of the setosa species is significantly different from the versicolor species.
# select only values from the Sepal.Width column where the Species is setosa
setosa_sw <- iris$Sepal.Width[ iris$Species == "setosa" ]
# same, but for versicolor and using the "subset" function
versicolor_sw <- subset(iris$Sepal.Width, iris$Species == "versicolor")
t.test(setosa_sw, versicolor_sw)
##
## Welch Two Sample t-test
##
## data: setosa_sw and versicolor_sw
## t = 9.455, df = 94.698, p-value = 2.484e-15
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.5198348 0.7961652
## sample estimates:
## mean of x mean of y
## 3.428 2.770
Exercise 5.3
A pairwise t-test between each group:
pairwise.t.test(iris$Sepal.Width, iris$Species)
##
## Pairwise comparisons using t tests with pooled SD
##
## data: iris$Sepal.Width and iris$Species
##
## setosa versicolor
## versicolor < 2e-16 -
## virginica 9.1e-10 0.0031
##
## P value adjustment method: holm
Warning: this is not a paired t-test, a paired t-test is something else (when the data comes in pairs!).
Like a t-test, but for multiple groups and multiple variables.
boxplot(Sepal.Width ~ Species, data=iris)
Next, a simple ANOVA to check whether Species is a significant covariate.
iris_anova <- aov(Sepal.Width ~ Species, data=iris)
summary(iris_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 11.35 5.672 49.16 <2e-16 ***
## Residuals 147 16.96 0.115
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can also use the broom
package to make a slightly
nicer table with results:
library(broom)
tidy(iris_anova)
## # A tibble: 2 × 6
## term df sumsq meansq statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Species 2 11.3 5.67 49.2 4.49e-17
## 2 Residuals 147 17.0 0.115 NA NA
The Tukey’s test is a post-hoc test that can be used to compare all pairs of groups. It is a pairwise t-test with a correction for multiple testing.
TukeyHSD(iris_anova, "Species")
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Sepal.Width ~ Species, data = iris)
##
## $Species
## diff lwr upr p adj
## versicolor-setosa -0.658 -0.81885528 -0.4971447 0.0000000
## virginica-setosa -0.454 -0.61485528 -0.2931447 0.0000000
## virginica-versicolor 0.204 0.04314472 0.3648553 0.0087802
cor.test(iris$Sepal.Width, iris$Sepal.Length)
##
## Pearson's product-moment correlation
##
## data: iris$Sepal.Width and iris$Sepal.Length
## t = -1.4403, df = 148, p-value = 0.1519
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.27269325 0.04351158
## sample estimates:
## cor
## -0.1175698
The cor.test
function runs a test for correlation.
Different tests are available, but the default is the Pearson
correlation test. A common alternative is the Spearman correlation test,
which is a non-parametric test. The Spearman correlation test is more
robust to outliers and non-normality, but it is less powerful than the
Pearson correlation test when the data are normally distributed.
cor.test(iris$Sepal.Width, iris$Sepal.Length, method="spearman")
## Warning in cor.test.default(iris$Sepal.Width, iris$Sepal.Length, method = "spearman"): Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: iris$Sepal.Width and iris$Sepal.Length
## S = 656283, p-value = 0.04137
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.1667777
We will start with very simple commands to visualize the data. R has (basically) two main graphics systems: the simple, base graphics and the more advanced ggplot2 system. We will start with the base graphics and use the ggplot2 system later. There are reasons to know both systems: the base graphics are very simple and easy to use, but the ggplot2 system is more powerful and allows for more complex visualizations.
plot(iris$Sepal.Length, iris$Sepal.Width)
Now we add the color of the points according to the species of the flower.
The following works only because iris$Species
is a
factor. Factors look like strings, but underneath they are integers. The
integers in the plot function correspond to colors: 1 is black, 2 is red
etc. The pch
parameter sets the shape of the points.
plot(iris$Sepal.Length, iris$Sepal.Width, col=iris$Species, pch=19)
However, at this point it is much more convenient to use the ggplot2:
library(ggplot2)
ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, color=Species)) + geom_point()
The first part sets up the plotting object. Then, we “add” geoms, that is, geometrical representations of the data.
Here is a trick to change the default theme:
theme_set(theme_minimal())
ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, color=Species)) + geom_point()
PCA is a method to reduce the dimensionality of the data (“dimensionality reduction method”). It is a very useful method to visualize the data and to find out which variables are most important in the data. It is also useful to find out whether there are any outliers in the data.
pca <- prcomp(iris[,1:4])
The result of the PCA is a list. We can see the contents of
the list with the str()
function:
str(pca)
## List of 5
## $ sdev : num [1:4] 2.056 0.493 0.28 0.154
## $ rotation: num [1:4, 1:4] 0.3614 -0.0845 0.8567 0.3583 -0.6566 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
## $ center : Named num [1:4] 5.84 3.06 3.76 1.2
## ..- attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## $ scale : logi FALSE
## $ x : num [1:150, 1:4] -2.68 -2.71 -2.89 -2.75 -2.73 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
## - attr(*, "class")= chr "prcomp"
We can access the objects in the list with the $
operator:
pca$center
pca$scale
pca$rotation
pca$x
The pca$x
object is what really interests us. It
contains the coordinates of the data in the PCA space. We can plot the
first two principal components as follows:
plot(pca$x[,1],pca$x[,2])
However, in order to use ggplot2, we must convert it to a data frame:
pca_df <- as.data.frame(pca$x)
## add the species column
pca_df$Species <- iris$Species
ggplot(pca_df, aes(x=PC1, y=PC2, color=Species)) + geom_point()
Exercise 5.5
Create a data frame which contains both the iris data (the first four
columns) and the PCA coordinates (the first two columns of
pca$x
), as well as the species column. Make plots which
combine a principal component (e.g. PC1) and one of the variables. Do
you see any correlation between the variables and the principal
components?
The basic regression equation:
\[y = \beta_0 + \beta_1 x + \epsilon\]
Here is how we do it in R. We use an example data set from R called “state.x77”. It contains various statistics about the US states from 1977.
library(datasets)
## convert a matrix to a data frame
df <- as.data.frame(state.x77)
ggplot(df, aes(x=Income, y=Illiteracy)) + geom_point()
Now we run a linear regression:
lm_fit <- lm(Illiteracy ~ Income, data=df)
summary(lm_fit)
##
## Call:
## lm(formula = Illiteracy ~ Income, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.79927 -0.46481 -0.09793 0.34011 1.24378
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0932014 0.5765787 5.365 2.3e-06 ***
## Income -0.0004336 0.0001288 -3.367 0.00151 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5539 on 48 degrees of freedom
## Multiple R-squared: 0.191, Adjusted R-squared: 0.1742
## F-statistic: 11.34 on 1 and 48 DF, p-value: 0.001505
tidy(lm_fit)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.09 0.577 5.36 0.00000230
## 2 Income -0.000434 0.000129 -3.37 0.00151
We can also plot the regression line:
ggplot(df, aes(x=Income, y=Illiteracy)) + geom_point() + geom_smooth(method="lm")
## `geom_smooth()` using formula = 'y ~ x'
Unlike in other programs, it is much simpler to make complex regression models in R.
df <- as.data.frame(state.x77)
# clean up the column names. You can also use the function
# `clean_names` from the package `janitor` to do it automatically.
colnames(df) <- c("pop", "income", "illiteracy",
"life_exp", "murder", "hs_grad", "frost", "area")
# alternative: df <- clean_names(df)
lm_fit <- lm(life_exp ~ income + illiteracy + murder, data=df)
summary(lm_fit)
##
## Call:
## lm(formula = life_exp ~ income + illiteracy + murder, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.49981 -0.61786 -0.01735 0.48470 2.17123
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71.1638497 1.0990196 64.752 < 2e-16 ***
## income 0.0003811 0.0002173 1.754 0.0862 .
## illiteracy 0.0368692 0.2997928 0.123 0.9027
## murder -0.2736317 0.0457495 -5.981 3.09e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8347 on 46 degrees of freedom
## Multiple R-squared: 0.6371, Adjusted R-squared: 0.6134
## F-statistic: 26.92 on 3 and 46 DF, p-value: 3.336e-10
ANOVA is practically equivalent to a simple regression model, where x is categorical.
iris_anova <- aov(Sepal.Width ~ Species, data=iris)
summary(iris_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 11.35 5.672 49.16 <2e-16 ***
## Residuals 147 16.96 0.115
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
iris_lm <- lm(Sepal.Width ~ Species, data=iris)
summary(iris_lm)
##
## Call:
## lm(formula = Sepal.Width ~ Species, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.128 -0.228 0.026 0.226 0.972
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.42800 0.04804 71.359 < 2e-16 ***
## Speciesversicolor -0.65800 0.06794 -9.685 < 2e-16 ***
## Speciesvirginica -0.45400 0.06794 -6.683 4.54e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3397 on 147 degrees of freedom
## Multiple R-squared: 0.4008, Adjusted R-squared: 0.3926
## F-statistic: 49.16 on 2 and 147 DF, p-value: < 2.2e-16
anova(iris_lm)
## Analysis of Variance Table
##
## Response: Sepal.Width
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 11.345 5.6725 49.16 < 2.2e-16 ***
## Residuals 147 16.962 0.1154
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The summary(iris_lm)
is not exactly what we want,
because rather than showing whether there are differences between the
groups, it checks whether the individual groups differ from the first
group (setosa). We need to use the function anova
on the regression model to actually get the same results as with
aov
. This function is pretty general and we can apply it in
many cases.
Bibliography must be found at the end of the document. There are basically two ways of dealing with bibliography:
Manual: you simply type in the text (Weiner et al. 2022) and then place the corresponding bibliographic entry at the end of the document. This is easier in the very, very short run.
Use a bibliography manager. This is easy in Rmarkdown (it is also possible in Jupyter)
Example: Altschul paper (Altschul et al. 1997)
"
, and then on “BibTex”.
Copy the contents to your newly created bibliography file and save it.
Note what the identifier of the article is (something like
watson1953structure
).bibliography: bibliography.bib
to the header.[@watson1953structure]
. Knit the document and look at the
output.@watson1953structure
(without
brackets), [-@watson1953structure]
or
[@watson1953structure; page 1]
?csl: nature.csl
(or whatever the file is called). Knit and
look at the output.