Summary

The final part of the course involves an individual project, taking shape as a data blog post. This should illustrate an issue/problem using an unique data set collected by yourself and at the same time illustrate the use of tools taught in this course. Deadline for the project is 2020-01-15 at 18:00. Hand-in occurs by raising an issue with the title “Project ready for grading” in your PR_<github_username> repository. Shortly after the deadline, we will clone all project repos with such an issue to a local installation. This version will count as your hand-in version. On 2020-01-17 the projects will be presented orally (5 minutes presentation) - presence is compulsory for the session you are presenting in, so please reserve that day in your calendar.

Inspiration

The below blog posts could be viewed as inspiration or to give a rough idea of the amount of work expected in the projects.

Data sources: During the course, you were introduced to a lot of possible data sources. Additional public web based data sources could, e.g., be the Stockholm Open Data Portal or the Öppna Data site.

Details

The project work has the following elements:

  1. find data out in the wild - this can be an open-access SQL database, API data1, web scraping data, personal surveillance data (e.g., running watch, log-files), data collected as part of a hobby, … The raw data should not be of sensitive nature (data protection!) and should be accessible and uploadable to Github without violating any copyright or access rights.
  2. determine a good story you can tell based on the data, e.g., a specific hypothesis you want to investigate, a cool visualization of numbers, a data journalism type of story, an educative #rstats post, something which might interest your fellow students. Your post can be about a serious matter, but it can also be a not so serious matter. However, make it clear before writing who is your intended readership (general public, fellow B.Sc. students, R users, ornotologists, …)
  3. read the data, wrangle the data, visualize the data, make simple statistical summaries and interpret the results in accordance with the selected story you chose in b.
  4. Write a story worthwhile reading for the selected target audience - it can be written in Swedish or English. The story needs to be written as a reproducible R-Markdown document with HTML backend (see technical details) and should not be excessively long. As a rough guideline: Between 1000-1500 words in the text, no more than 7 figures. No more than 7 visible code chunks (if you decide to have any at all). Note: You can use the wordcountaddin for RStudio to count the words in your report.
  5. Create a 5 minute presentation about your work to present to your fellow students on 2020-01-17. The main aim of your presentation is to convince your fellow students that they should read your blog post. A template for this will be made available within the next days.

The biggest challenge of the project will be to be realistic about what you can achieve within the given deadline. Once you have an estimate of how much that could be, take 50% of that and you are still likely to be busy. Make sure you have a working project early on and then scale up iteratively, so you’re always ready. Start early.

Technical details

For every student will get create a private PR_<username> GitHub repository as part of the mt5013-ht19 organisation, which only you and the teacher’s of the course have read/write access to (similar setup as your HW_<username> repo) You should make sub-directories Data, R, Report and Presentation and store all your files there. At the project deadline we will pull all repos, which have an issue “Project work submitted”.

At submission, your repo should at least contain the following files:

PR_jensjensen/Report/report.Rmd
PR_jensjensen/Report/report.html
PR_jensjensen/Presentation/presentation.Rmd
PR_jensjensen/Presentation/presentation.html

where jensjensen is to be substituted with your GitHub user name. Note: It’s important to pick exactly the filenames as above, since we will extract report and presentation automatically from your repository. Furthermore, ensure that any support files like data files, graphics, etc. which are needed to compile the .Rmd documents. One exception are data aquired by using private API-keys. To this end, make a R/query_data.R script, which imports the API key stored somewhere outside the git, does all the work and finally stores the data using save in the Data directory. Your R Markdown report should then access the data by using load. See HW_data/swedavia_api.R folder for an example.

The data remain private as part of the PW_<username> repository, but your project report HTML-file will be made accessible to the teachers and students of the course.

In contrast to the homework exercises we use HTML as backend, see e.g. Section 3.1 of the R Markdown: The definite guide for options to customize. In order to ensure portability of the HTML reports please use the following as part of your YAML header:

----
title: The snappy title of your project
author: Jens Jensen <jens.jensen@student.su.se>
date: 2020-01-15
output:
  html_document:
    self_contained: true
    toc: true
    toc_depth: 2
---

Additional note: It’s possible that we succeed in creating a project repo template, which already contains some of the necessary files and directories.

Final note: For the presentation the data wrangling steps do not need to be repeated. One can either import all data generated by report.Rmd using load or use some other type of caching.

Grading

The project will be graded based on the following five dimensions, which have equal weight:

  1. Technical difficulty of the project (e.g. how hard to get the data imported, how much time needed to wrangle, how advanced are some of the methods to get statistical summaries, how original is the project idea)
  2. Coding style and reproducibility of the Rmd file. This includes an assessment of, whether problems were solved tidyverse style.
  3. Quality of the visualisations and their interpretation
  4. Readability of the project report (is the story concise, are the aims clear, is the readership happy, spelling & grammar). In particular: Less is sometimes more.
  5. Quality of the presentation (slides, snore factor, staying in time, …)

Lycyka till!


  1. If you want to use SCB data extractable by their web interface, please us the pxweb package for this instead.