6 Twitter Streams

This overview shows an example script to scrape tweets based on keywords on a daily basis via crontab. The script saves the stream for the specified keywords and in regular steps saves a data frame so that the file won’t be too big to handle.

6.1 Presettings

# Cleans workspace environment of R:
rm(list=ls(all=T))

# Makes sure strings are never transformed to factor variables:
options(stringsAsFactors = F)

6.2 Libraries

if (!require("pacman", quietly=T)) install.packages("pacman")
pacman::p_load(rtweet, jsonlite, rjson, httr, RCurl, data.table, readr, dplyr)

6.3 Connect with Twitter App

The next lines of code build a connection with your twitter app. If you haven’t set up this file already please do so as it is explained in the rtweets Package documentation. Otherwise this script will not work.

# Loads authentification key (token) for the twitter API which is stored 
# in an environment variable:
token <- readRDS("~/Path/Of/Your/File/twitter_token.rds")

6.4 Data Collection

6.4.1 Step 1)

Define the keyword(s) you want to follow.

keywords <- c("#bundesrat")

6.4.2 Step 2)

Set up the time frame of the stream, i.e., after when a data frame will be saved to keep data files small. This version defines a stream which collects data for 4 hours straight before starting a new data frame. Repeating the script for 8 times a day thus gives a full day of Tweets.

streamtime <- 8*60*60 # Stream for 4 hours before starting a new data frame
icount <- 8 

6.4.3 Step 3)

Prepare the local directory to save the data frames.

# where is your directory?
mainDir <- "~/Data/Twitter/Streams/Bundesrat"

# Get todays date:
date <- Sys.Date()

# Make Dir for data if it does not already exist.
subDir <- paste0("bundesrat_",date)
ifelse(!dir.exists(file.path(mainDir, subDir)), 
       dir.create(file.path(mainDir, subDir)), FALSE)
Sys.chmod(file.path(mainDir, subDir), mode = "777", use_umask = FALSE)

6.4.4 Step 4)

With the token ready, the time set up and a place to save the tweets, data collection can be started.

# Stream Tweets to large json file (one file for every 6 hours)
# i runs up to 360 in case there is a reconnect the loop does not stop to early
setwd(paste0(mainDir,"/",subDir)) #set dir

for(i in 1:8){
  part <- i 
  cat("Currently ", i, " steps have been streamed! ", "Day Var is: ", date, 
      " and partday is: ", part, "\n")
  stream_tweets(q = keywords, timeout = streamtime, 
                parse = FALSE, verbose = FALSE,
                file_name = paste0("Bundesrat_Stream_",date, "_part_",part, 
                                   ".json"), 
                token = token)
  cat("One Quarter Day has been Streamed...\n")
  Sys.chmod(paste0("Bundesrat_Stream_",date, "_part_",part, ".json"), 
            mode = "777", use_umask = FALSE)
  Sys.sleep(1)
}

6.4.5 Step 5)

In a last step, make one rds file from all json files.

# Get list of all files with the tweets collected from arena this week
path <- paste(getwd())
filenames <- list.files(path, pattern = "*.json", full.names = T)

tweetsdf <- data.frame()

for(j in filenames){
  setwd(path)
  tmp <- parse_stream(j)
  
  tmpu<- as.data.frame(tmp)
  tweetsdf <- rbind(tweetsdf, tmpu)
}

saveRDS(tweetsdf, paste0("bundesrat_tweets_", date,".rds"))

similarly to the user data collection, you can use crontab to schedule a recurring stream collection.