Day 9: Text data: Regular expressions

During class

A csv-reader

Combine readLines and str_split to a simple .csv-reader function with header my_csv_reader <- function(file) that reads a .csv-file and returns a matrix of strings with the contents of a file.

HTML tables

The string table contains a simple table in HTML,

table <- "<table>
  <tr>
    <th>Förnamn</th>
    <th>Efternamn</th> 
    <th>Ålder</th>
  </tr>
  <tr>
    <td>Kalle</td>
    <td>Karlsson</td> 
    <td>25</td>
  </tr>
  <tr>
    <td>Lisa</td>
    <td>Larsson</td> 
    <td>17</td>
  </tr>
</table>"

that a web-browser renders as

Förnamn	Efternamn	Ålder
Kalle	Karlsson	25
Lisa	Larsson	17

Use regex to extract the vector

from table.

Motions of the Riksdag

Data from this exercise were obtained from the Open data of the Swedish Riksdag and contains a list of motions proposed by members of the Riksdag during 2014-2017 (scroll a bit down to find the topic “Motion - Motioner från riksdagens ledamöter.”). Read it by

motions <- read_csv("Class_files/mot-2014-2017.csv", 
                     col_names = c("hangar_id", "dok_id", "rm", "beteckning", 
                                   "typ", "subtyp", "doktyp", "dokumentnamn",  "organ", 
                                   "mottagare", "nummer", "datum", "systemdatum", 
                                   "titel", "subtitel", "status"))

There are two sources of information on the political party (S, V, Mp, M, L, C, Kd or Sd) behind the motion, in dokumentnamn and in subtitel. Use e.g. str_extract with suitable regex to extract party from both sources.

Note that a motion may be proposed by a group of members, possibly of different parties. Resolve or ignore this as you like.

Plot the monthly number of motions colored acording to party.

Sentiment of the people of Hemsö

Strindberg´s Hemsöborna can be downloaded from Project Gutenberg (in Swedish) with

text <- readLines("http://www.gutenberg.org/cache/epub/30078/pg30078.txt", encoding = "UTF-8") %>% .[93:5129]

Convert the text to a data.frame with the variable word containing all words of the text in lower case and with any punctuation marks removed. Add the variables nr = 1:n() and chapter = cumsum(word == "kapitlet"). When analysing text, so called stop-words are usually removed. A list of Swedish stop-words can be downloaded by

stopwords <- read_table("https://raw.githubusercontent.com/stopwords-iso/stopwords-sv/master/stopwords-sv.txt",
                       col_names = "word")

Remove the stop-words from the text with anti_join.

Sentiment analysis is a way of quantifying positive/negative emotions in a text, you can find specialised course on DataCamp if you are interested. This is generally done by a sentiment lexicon that contains a list of words quantified as positive or negative, a Swedish lexicon can be found at Språkdatabanken and downloaded by

sentiment <- read_csv("https://svn.spraakdata.gu.se/sb-arkiv/pub/lmf/sentimentlex/sentimentlex.csv")

Join the lexicon with the text with an inner_join and try to visualise how the sentiment of the text changes as a function of chapter or nr.

Note: The statistical analysis of text has become rather popular, e.g., in marketing or sociology, and is sometime also known as NLP (Natural languae processing). More information about how to do this in R can be found in, e.g., in the book Text Mining with R by Silge and Robinson. For those interested this could also be a nice topic for your project work in the course. Here is one blog post about Donald Trump’s tweets, which back in 2016 made it to the news.