Do this before class

Read R4DS chapters 3.1-3.6, 5.1-5.5

Complete assignments Data wrangling and Data visualization (first two chapters of Introduction to the Tidyverse) at DataCamp.

Class activities

Start by creating a new R-project “Classroom” that you will use for your class activities. Some activities will require data or scripts from the repo Class_files, we therefore recommend that you clone this into a subfolder of your Classroom directory. Now create a new R Markdown document Class1.Rmd where you will do your work for this class.

Systembolaget’s assortment

Systembolaget’s assortment of beverages from 2019-10-30 is available in the file Class_files/systembolaget2019-10-30.csv. It is downloaded from Systembolaget’s public API and saved in csv-format by the script Class_files/Systembolaget.R. Load its contents by

# Define date when scraping took place - can then be easily changed.
date_systembolaget_scrape <- "2019-10-30"
library(tidyverse)
file_name <- paste0("systembolaget",date_systembolaget_scrape,".csv")
Sortiment_hela <- read_csv(file.path("Class_files", file_name))

Exercises (training of arrange, filter, mutate, select, %>%)

  • The variable Alkoholhalt (alcohol by volume) has been classified as character by read_delim, since it contains a percent sign. Convert it to numeric using mutate by first removing the percent sign (e.g. with gsub) and then transform with as.numeric.

  • A few wines are labelled as Röda - lägre alkoholhalt and Vita - lägre alkoholhalt instead of Rött vin (red wine) respektive Vitt vin (white wine) in the Varugrupp (group of products) column. Merge these wines into Rött vin and Vitt vin, respectively, e.g. by using mutate and ifelse.

  • What beverage has the highest PrisPerLiter? Display the answer (the Namn of the beverage) as dynamically coded in the text body of your .Rmd-document.

  • Create a new data frame Sortiment_ord with the regular product range (where SortimentText equals Ordinarie sortiment). Make a table (with kable from the knitr-library) of the 10 most expensive (PrisPerLiter) beverages from this range. Use select to select suitable columns for the table.

  • if you have not already done so, write the code from the previous excercise using a sequence of pipes (%>%).

Exercises (training of ggplot, geom_point, geom_line, facet_wrap)

For the regular product range in Sortiment_ord

  • Plot PrisPerLiter against Alkoholhalt, color the points by Varugrupp and consider using a log-scale for PrisPerLiter.
  • Plot PrisPerLiter (possibly on a log-scale) against Varugrupp. Consider coord_flip to improve readability.
  • For the groups of products c("Vitt vin", "Rött vin", "Rosévin", "Mousserande vin") of vintage (Argang) 2010-2019, plot PrisPerLiter against Argang. Try both using a facet for each group and coloring by group in the same facet.

Further excercises

  • Use your imagination and keep exploring the data.

Film events

The Stockholm international film festival takes place early November each year. In Class_files/Film_events_2018-11-07.csv you will find their event schedule for the 2018 edition scraped on 2018-11-07. The data were obtained by a query using the Stockholm Film Festival API for developers.

Exercises (training of arrange, filter, mutate, select, %>%)

  • What films are already sold out (for all screenings)?
  • What venue screens the most number of (unique) films?
  • Plot the proportion of sold out events for each day of the festival.

Olympic winter medals

The file Class_files/Winter_medals2019-10-30.csv contains the number of medals per country and olympic year at the winter olympics since 1980 together with the total population of the country. The data set is scraped from Wikipedia using the script Class_files/Winter_medals.R which contains more information, in particular on countries that has been split or joined during the period.

Load the file using

winter_medals <- read_csv("Class_files/Winter_medals2018-09-26.csv")

Exercises (training of arrange, filter, mutate, select, %>%)

  • Create a variable column medals_per_mill, the number of medals per million inhabitants.
  • Print a table of the 10 most successful countries, by medals_per_mill, during the 2018 Winter olympics.

Exercises (training of ggplot, geom_point, geom_line, facet_wrap)

  • Plot the total number of medals against year for Sweden, Norway and Finland in the same figure and separate the countries with a suitable aesthetic (see ?geom_point for a list of aesthetics geom_point understands).
  • Plot the number of Swedish gold, silver and bronze medals against year (bonus: color the points in gold/silver/bronze).
  • As previous excercise, but with a “facet” for each of Sweden, Norway and Finland.

Gapminder

Use ggplot to recreate (static versions) of some figures from Hans Rosling’s talks. Data is available in the package gapminder.