8 RSS Feeds

This documentation shows you how to setup a web scraper using a rss feed to gather data. we recommend you use rss feeds whenever they are available, due to the very nice structure they provide. Basically, with one working script you are able to build web scrapers for many other rss feeds from other pages. The only thing you will have to change are the URL and maybe some other small keywords regarding the output of the rss feed, but the structure always stays the same as all rss feeds use a similar code structure.

One thing we want to mention before we start documenting the code is: Just because you can crawl the web doesn’t mean that you always should. There may be terms and conditions that explicitly forbid you to do so. We cannot stop you from violating this but be aware that there are methods to prevent you from doing so. Secondly be kind to the webhosting server and try to minimize the load you put on it. A web scraper without timeouts will put quite a load on the server. A good guideline is to build your web scraper in a way that they click or load an element only once per second. For rss scrapers this is not so difficult to achieve as you only load the rss feed once every while, depending on the frequency new information is published on the feed.

8.1 Presettings

# Cleans workspace environment of R:
rm(list=ls(all=T))

# Makes sure strings are never transformed to factor variables:
options(stringsAsFactors = F)

8.2 Libraries

To make working web scraper you normally need a few libraries. These are basically packages like RCurl, httr, XML and rvest, which contain the necessary tools to access webpages and copy data from them. Other packages are used to parse and save the data in your desired format. With these packages installed you should have no problem scraping data from pages with rss.

if (!require("pacman", quietly=T)) install.packages("pacman")
pacman::p_load(httr, RCurl, rvest, dplyr, data.table, readr, tidyr, stringr, 
               jsonlite, rjson, XML)

8.3 Data Collection

8.3.1 Step 1)

Indicate in what directory you want to save the rss feed in and specify the url of the rss feed in question. In this example we will scrape the rss feed from the swiss federal council. Make sure to set your syslocale to en_US.UTF-8 as most of the rss feeds use this format for their content. This will come in handy later.

# Set directory
setwd("~/Data/RSS/Akteure/admin_ch_Medienmitteilungen")

# Save URL of the rss feed as a variable
urlrss <- c(paste0("https://www.newsd.admin.ch/newsd/feeds/rss?lang=de",
                   "org-nr=1&topic=&keyword=&offer-nr=&catalogueElement=",
                   "&kind=M,R&start_date="))

# Set some options for Curl and Systemlanguage 
myOpts <- curlOptions(connecttimeout = 120)
Sys.setlocale("LC_TIME", "en_US.UTF-8")

8.3.2 Step 2)

Now you can start with the actual scraping. In this example we build a code which gets the rss files from the last two weeks and checks which one we already got in our folder. Hence, if this code is executed every week it will download only the rss files from the previous week and stop as soon as it reaches the last file it scraped the week before. The first step is to initialize vectors which will gather the data from each rss file. Then we ask the system to give us todays date from which we subtract 14 days and store this date in a variable.

#Vectors
Titel <- c()
DatumTime <- c()
Link  <- c()
Beschreibung <- c()
Kategorie <- c()

#Date from two weeks ago 
pasttwoweek <- Sys.Date()-14

8.3.3 Step 3)

This is part where we actually copy the data from the rss feed from the webpage. We do this with the httr package and its GET command which tells the webpage to give us all the rss feed from the last two weeks. In the same pipeline we tell R to read out the xml structure of the url. This is saved as a variable which we than pare with the xml parser.

doc <- httr::GET(paste0(urlrss, pasttwoweek, "&end_date=")) %>% 
  xml2::read_xml()

Sys.sleep(1)
# convert document to XML tree in R
doc <- xmlParse(doc)

8.3.4 Step 4)

Now we can extract the information from the xml file. First, we extract the title of the article and process this a bit into a nice format. Then we extract the publishing time. The next step is to extract the category name of the article and its link. Finally, we extract the short description of the article and format it nicely. Additionally, we build variables containing the name of the publisher and its short form and its source. We save all these information in the vectors we made before.

## find the names of the item nodes
# unique(xpathSApply(doc,'//item/*',xmlName, full=TRUE))
## Extract some information from each node in the rss feed
aTitel<- xpathSApply(doc,'//item/title',xmlValue)
aTitel <- gsub("\r?\n|\r|\t", " ", aTitel)
aTitel <- trimws(aTitel, which = c("both"))
aTitel<- as.character(aTitel)
aDatumTime <- xpathSApply(doc,'//item/pubDate',xmlValue)
aKategorie <- xpathSApply(doc,'//item/category',xmlValue)
aLink <- xpathSApply(doc,'//item/link',xmlValue)
aLink <- as.character(aLink)
aBeschreibung <- xpathSApply(doc,'//item/description',xmlValue)
aBeschreibung <- gsub("\r?\n|\r|\t", " ", aBeschreibung)
aBeschreibung <- trimws(aBeschreibung, which = c("both"))
aBeschreibung <- as.character(aBeschreibung)
Akteur <- "Schweizerische Eidgenossenschaft"
Kürzel <- "admin.ch"
Quelle <- 
  "https://www.admin.ch/gov/de/start/dokumentation/medienmitteilungen.html"


Titel <- c(Titel, aTitel)
DatumTime <- c(DatumTime, aDatumTime)
Datum <- as.Date(DatumTime,  "%Y-%m-%d")
Link  <- c(Link, aLink)
Beschreibung <- c(Beschreibung, aBeschreibung)
Kategorie <- c(Kategorie, aKategorie)

8.3.5 Step 5)

In this example we can access the whole articles by following the links of the rss feed. When this is not possible due to restricted access you can skip this for your own work. First, we load a file which contains the last rss file we downloaded. If this has never been done before the if-else clause will create one for you.

# Only execute once
oldnews <- NULL
if(file.exists("admin_rss_latest.csv")){
  oldnews <- read_csv("admin_rss_latest.csv")
} else {
    Datum <- c("1900-01-01", "1900-01-01")
    Datum <- as.Date(Datum, "%Y-%m-%d")

    Titel <- c("TextTextText", "TextTextText")
    Akteur <- c("Schweizerische Eidgenossenschaft", 
                "Schweizerische Eidgenossenschaft")
    Kürzel <- c("admin.ch", "admin.ch")
    data <- data.frame(Datum,  Kürzel, Akteur, Titel)
    fwrite(data, "admin_rss_latest.csv")
    old <- 0
    old <- as.data.frame(old)
    fwrite(old, "Counter_admin_rss.csv")
    oldnews <- read_csv("admin_rss_latest.csv")
}

old <- fread("Counter_admin_rss.csv")

Then we make some new vector for the text we will copy and set a pause value for a later load minimizer. Now we copy the filecount we already have into a variable called old. At the same time we extract the titel of the last file we got from the rss feed. Then we get the length of the link vector to set up how many iterations our loop has to do at maximum.

tmptex <- c()
pause <- 1
j <- old[1,1]
numerator <- old[1,1]
limit <- as.character(oldnews[1,4])
tmp3 <- data.frame()
Lange <- c(1:length(Link))

Finally, we run the loop which accesses each links site, copies the text from this site and then combines all vectors to a data frame with one row containing all objects as columns while writing this to the disk as a JSON file. The key here is that this loop stops as soon as it recognizes that the file it is currently processing is the last file written from the last execution of this script. Be aware that this part can change quite a bit regarding the processing for a nice formatted text.

for(i in Lange){
  tryCatch({
  lin <- Link[i]
  pg <- read_html(lin)
  Tex <- pg %>%                                #Get Text from Page 
    html_nodes(xpath = '//p | //strong') %>% html_text()
  Tex <- gsub("\r?\n|\r|\t", " ", Tex)         #Removes al \n 's 
  Tex <- trimws(Tex, which = c("both"))        #Remove ws at beginning/end
  if(length(Tex) >= 10){                       
    #Remove unneccessary parts of text for two different types of text files
    #available on the site and decide which type it is.
  Tex <- Tex[1:(length(Tex)-3)]
  }
  Tex <- Tex[2:(length(Tex)-5)]
  len <- 1:(length(Tex)-1)
  remall <- c()
  for(l in len){
    k <- l+1
  if(Tex[l]==Tex[k]){
      rem <- l
      remall <- c(remall, rem )
    }
  }
  Tex <- Tex[-(remall)]
  #Change multiple whitespaces to one whitespace 
  Tex <- gsub("(?<=[\\s])\\s*|^\\s+|\\s+$", "", Tex, perl = T) 
  #Collapse Text parts to one sting
  Tex <- paste(Tex, sep=" ", collapse=" ")                     
  #Remove the first 20 characters from the text as they contain
  #the date an other information we store elsewhere
  Tex <- substring(Tex, 20)                                    
  Text <- Tex

  tmp <- data.frame(Datum[i], Kürzel, Akteur, Titel[i], Beschreibung[i], 
                    Text, Link[i], Quelle)
  colnames(tmp) <- c("Datum", "Kürzel", "Akteur", "Titel", "Beschreibung", 
                     "Text", "Link", "Quelle")
  tmp2 <-data.frame(Datum[i], Kürzel, Akteur, Titel[i]) 
  colnames(tmp2) <- c("Datum", "Kürzel", "Akteur", "Titel")
  tmp3 <-rbind(tmp3, tmp2)
  check <- Titel[i]
  if(check == limit){
    break  
    # Here the loop checks whether the file is still a new one or already 
    # the last file from the last execution 
  }
  j <- j + 1
  mytime <- Datum[i]
  myfile <- file.path(getwd(), paste0("admin_ch_medienmitteilungen_", 
                                      mytime, "_ID_", j, ".json"))
  write_json(tmp, path = myfile)
  
  pause <- pause + 1
  if(pause %% 1000 == 0) {Sys.sleep(125)}
  Sys.sleep(1)
  
  }, error=function(e){cat("ERROR: ", conditionMessage(e), 
                           " Error Code on page: ", i, "\n")})
}

if(j > numerator) {
  j <- as.data.frame(j)
  fwrite(j, "Counter_admin_rss.csv")
  fwrite(tmp3,"admin_rss_latest.csv")
}