5 Twitter Users

This overview shows an example script to scrape tweets from a Twitter account on a daily basis via crontab. The script checks if the collected tweets are newer than the last tweet written the data base before adding the one’s not yet in the data base. This makes it possible to look for new tweets on a regular basis without being logged in the entire time. The advantage of this method is that the we do not have the threat of losing data through crashes.

5.1 Presettings

# Cleans workspace environment of R:
rm(list=ls(all=T))

# Makes sure strings are never transformed to factor variables:
options(stringsAsFactors = F)

5.2 Libraries

if (!require("pacman", quietly=T)) install.packages("pacman")
pacman::p_load(rtweet, jsonlite, rjson, httr, RCurl, data.table, readr, dplyr, plyr)

5.3 Connect with Twitter App

The next lines of code build a connection with your twitter app. If you haven’t set up this file already please do so as it is explained in the rtweets Package documentation. Otherwise this script will not work.

# Loads authentification key (token) for the twitter API which is stored in an
# envrionment variable:
token <- readRDS("~/Path/Of/Your/File/twitter_token.rds")

5.4 Data Collection

Now we can set up our data collection process. First, we need to get the timelines of the actor or actors we are interested in. Be aware that there is a limit of 3200 tweets per account and a limit of about 15’000 tweets per hour you can load form your twitter app (Rate limiting). After we got the timeline in the environment called ttl we look if there is already a lookup file to check for new tweets. If there is no lookup file, we create an empty one. Then we load all new tweets in a new data frame called df. Tweets which are older than the last tweet we already got in our database are not written into this file. Hence, we will only write new files into our database. This particular step is done in the following loop which writes each tweet from the timeline in a separate JSON file into a folder of our choice. As a last step we save the highest tweet_id in our lookup file and store it in the folder as well for later updates of our database.

5.4.1 Step 1)

Get tweets from one or more accounts. Here, for example, we get as many tweets from the SP Schweiz (Social democratic party of Switzerland) as possible and store them in the list ttl. As soon as you have run this script once decrease the number of tweets the script should get, as there is no need to load very old tweets we have already loaded.

# Gets tweets from timeline of the users with rtweet through the API:
# Set n = 3200 for the first time. Later n = 100 is enough if the script is 
# executed on a daily basis:
ttl <- get_timelines(c("SPSchweiz"), n = 3200)

5.4.2 Step 2)

Try to load file which contains highest tweet_id we already got from the accounts. This would return the highest id we have from the SPSchweiz if there were any tweets already in the data base.

# Loads lasttweet written to disk as.json if script already run once or more. 
# Otherwise a new controlfile is made with a staus id of "1"
# (We use character for the staus id, because the id's are nowadays bigger 
# than 2^64 which is problematic for 64bit systems) which is always older 
# than the oldest tweet of any users oldest status (tweet):
lookup <- NULL
if(file.exists("lasttweet.rds")){lookup <- readRDS("lasttweet.rds")}
lastid <- lookup$Status_id
if(is.null(lastid)){lastid <- 0}

5.4.3 Step 3)

Here we extract all tweets which are newer than the newest tweet already in the database from the accounts we are currently updating. In this case we would extract all tweets which are newer than our newest tweet from the SPSchweiz.

# Gets new tweets from ttl by checking if status_id is higher than the highest 
# id we already have: 
df <- data.frame()
for(j in 1:nrow(ttl)){
  id <- ttl[j,c("status_id")]
  if(id > lastid){
    tmp <- ttl[j, ] 
    df <- rbind(df,tmp)
  } else {break}
}

5.4.4 Step 4)

Now it is time to write each new tweet we got into our folder for later preprocessing and uploading into the database. Before we run the loop doing this we check if there are at least one or more new tweets otherwise we stop terminate the process directly before running anything more. If there are new tweets they are written into a folder with new column names as a Json file. As soon the latest tweet is written the latest tweet id overwrites the latest tweet id in our lookup file.

# If no new Tweets print "No new Tweets Today" else write the new Tweets to 
# json-files:
if(is.data.frame(df) && nrow(df)==0){
  print("No new Tweets Today")
} else {
  # Order data by id number:
  df <- df[order(df$created_at),]
  # Gives names variables:
  colnames(df) <- c("User_id", "Status_id", "Datum", "Screen_name", "Text", 
                    "Source", "Display_text_width", "Reply_to_status_id",
                    "Reply_to_user_id", "Reply_to_screen_name", 
                    "Is_quote", "Is_retweet", "Favorite_count", 
                    "Retweet_count", "Hashtags", "Symbols", "Urls_url", 
                    "Urls_t.co", "Urls_expanded_url", "Media_url",
                    "Media_t.co", "Media_expanded_url", "Media_type", 
                    "Ext_media_url", "Ext_media_t.co",
                    "Ext_media_expanded_url", "Ext_media_type", 
                    "Mentions_user_id", "Mentions_screen_name", 
                    "Lang", "Quoted_status_id", "Quoted_text", 
                    "Quoted_created_at", "Quoted_source", 
                    "Quoted_favorite_count", "Quoted_retweet_count", 
                    "Quoted_user_id", "Quoted_screen_name",
                    "Quoted_name","Quoted_followers_count",
                    "Quoted_friends_count","Quoted_statuses_count", 
                    "Quoted_location", "Quoted_description", 
                    "Quoted_verified", "Retweet_status_id", 
                    "Retweet_text", "Retweet_created_at", 
                    "Retweet_source", "Retweet_favorite_count",
                    "Retweet_retweet_count", "Retweet_user_id", 
                    "Retweet_screen_name", "Retweet_name",
                    "Retweet_followers_count", "Retweet_friends_count",
                    "Retweet_statuses_count", 
                    "Retweet_location", "Retweet_description", 
                    "Retweet_verified", "Place_url", "Place_name", 
                    "Place_full_name", "Place_type", "Country",          
                    "Country_code", "Geo_coords", "Coords_coords", 
                    "Bbox_coords", "Status_url", "Name", "Location", 
                    "Description", "Url", "Protected", "Followers_count",
                    "Friends_count", "Listed_count", "Statuses_count",
                    "Favourites_count", "Account_created_at", "Verified", 
                    "Profile_url", "Profile_expanded_url", "Account_lang",         
                    "Profile_banner_url", "Profile_background_url", 
                    "Profile_image_url")
  
  # Add Columns with actor name and shortname as well as source:
  df$Akteur <- "Sozialdemokratische Partei der Schweiz"
  df$Kürzel <- "SPS"
  df$Quelle <- "https://www.twitter.com"
  
  # Modifiy data:
  # Converts Status_id to character for saefty reasons:
  df$Status_id <- as.character(df$Status_id)
  
  # Converts Retweets-creation-time to character:
  df$Retweet_created_at <- as.character(df$Retweet_created_at)
  
  # Converts Quoteds-creation-time to character:
  df$Quoted_created_at <- as.character(df$Quoted_created_at)
  
  # Converts NA's to empty strings, since Elastic doesn't like NA's:
  df[, 1:ncol(df)][is.na(df[, 1:ncol(df)])] <- ''
  
  # Since elastic will not accept empty date fields for any type of time 
  # fields we give them a special time-stamp:
  df$Retweet_created_at[df$Retweet_created_at == ''] <- "1900-01-01"
  df$Quoted_created_at[df$Quoted_created_at == ''] <- "1900-01-01"
  
  # Saves each Tweet (row) in a JSON file:
  for(i in 1:nrow(df)){
    tmp <- df[i,]
    mytime <- tmp$Datum
    mytime <- as.POSIXlt(mytime)
    Name <- tmp$Kürzel
    Text <- tmp$Text
    tmp$Datum <- as.POSIXlt(tmp$Datum)
    tmp$Text <- gsub("\r?\n|\r", " ", Text) # Replace \n \t \r with wh.space
    myfile <- file.path(getwd(), paste0(Name,"_Twitter_", mytime, ".json"))
    write_json(tmp, path = myfile, na = 'string', auto_unbox = TRUE)
    }
  
  #Saves last Tweet (newest Tweet) in lasttweet.csv for next run of script:
  newesttweet <- df[nrow(df),]
  saveRDS(newesttweet, file = "lasttweet.rds")
}

Now you have your first data collecting twitter script. If you want it to run on a daily or weekly basis you can schedule the script on your Linux server with crontab, which is fairly easy (See: https://help.ubuntu.com/community/CronHowto).

Also, if you are interested in more than one user, you can use a list of Twitter handles to run the procedure for multiple accounts instead of just one.