Using R to learn about your tweeting habits – Part One

29 Jan
2017

Ever wondered what kind of tweeter you are? What do you tweet about the most? Do you tweet more on weekdays like me, or are you more of a weekend tweeter? Are you hashtag prone and pepper your tweets with hashtags or do you completely forget to hashtag sometimes. I am the former type but I am trying to kick the habit! It’s not easy, given that twitter lets you hashtag anything and everything. This post will describe how you can get your own twitter archive and analyze it from different angles.

Let’s get that Twitter Data

After finishing my course on Data Analytics using R, I wanted to learn more about R API’s available for text manipulation, text analysis and plotting. Since, I am a regular tweeter, and my tweeting data is readily available, I though why not use R to find out more about my tweeting habits. Twitter lets you easily access your tweets in a nicely formatted excel file. All you have to do is browse to your Twitter Profile > Settings page and look for the “Your Twitter Data” option on the bottom left. Once you click on that, the next page should have the option for downloading your entire Tweet history

This leads to the option as seen below, of getting Twitter to email you your archive. My screenshot shows a “Resend email” option because I just requested an archive. You should receive an email within a few minutes, if not seconds, with your data in a zip format.

 

Review the Data

By default, the data contains all the columns in the image below. You have the option of removing any columns you think you won’t need for the analysis you do, but I kept them all, just in case. I did. This is what my twitter data looked like after I did some minor formatting. I had about 800 tweets in my data set.

 

Load your Twitter data into RStudio

Use a simple read command in R to load the data.
setwd(‘C:/Documents/RStudio Working Directory’) # Set the working directory where the tweets.csv resides. Yours can be different
mytweets <- read.csv(“Salman Twitter Archive/tweets.csv”, stringsAsFactors = FALSE) #
“stringsAsFactors = FALSE” turns off automatic conversion of character strings to factors.

Use the summary function to check type of each column
summary(mytweets)

You will notice that the timestamp column is of type character. To do any time-based analysis, we need the timestamp column to be of type date/time which will play well with the plotting functions. Let’s fix this using the lubridate package which provides tools that make it easier to parse and manipulate dates. To me, it makes most sense to convert the date into a Year Month Data format so here we go.

library(lubridate)
mytweets$timestamp <- ymd_hms(mytweets$timestamp)

You can run the summary command again to confirm that the timetstamp is now of type date.

To learn more about lubridate and other packages, you can check out R-bloggers website.

 

Plot some graphs!

Install the ggplot2 package if you haven’t already and load the ggplot2 library

library(ggplot2)

Tweets over time

The code below will plot your tweets over time. The more the tweets for that time period, the darker the color shade

ggplot(data = mytweets, aes(x = timestamp)) +
geom_histogram(aes(fill = ..count..)) +
theme(legend.position = “none”) +
xlab(“Time”) + ylab(“Number of tweets”) +
scale_fill_gradient(low = “red”, high = “blue”) # decides shade based on frequency of tweets. Low frequency = red shade.

Use “fill = ..count..” with geom_histogram if you want to the intensity of the histogram to reflect its height.

twitter R image - tweets over time

Just glancing at the plot quickly, I can tell what I was up to over the last year. I opened my Twitter account in April last year, right after my data management course which introduced me to ways of capturing and analyzing social media data. The highs in the summer time reflects longer days which meant I was more active and probably had more time to tweet. I think productivity is directly proportional to the weather. At least for me, summer time is when I am most productive. The radio silence in late August and early September made sense because I was vacationing in Karachi/Istanbul and completely disconnected myself from social media. The frequency of my tweets slowed down in November, as I was likely reeling from the unexpected results of the US elections or maybe it was all those R assignments I was working on. Either way, I recovered and started strongly in January.

Busiest tweeting day of the week

I applied the wday method on the timestamp columns to categorize the data into weekdays and then used this to plot this graph.

ggplot(data = mytweets, aes(x = wday(timestamp, label = TRUE))) +
geom_bar(fill = “springgreen4”, width = 0.4) +
ggtitle(“Tweets by Weekday” ) +
xlab(“Day of Week”) + ylab(“Number of tweets”)

twitter R image - tweets over weekday

Clearly I am not an avid weekend tweeter. Maybe you are!

Tweets vs Retweets

Before looking at the results of this analysis, I would have guessed that I retweet more than I tweet. But the data tells me otherwise. A reminder to always double check your gut instincts with data driven analysis.

ggplot(data=mytweets, aes(factor(!is.na(mytweets$retweeted_status_id)))) +
geom_bar(fill = “coral2”) +
xlab(“”) + ylab(“Number of tweets”) +
ggtitle(“Retweeted or Not” ) +
scale_x_discrete(labels=c(“Not Retweeted”, “Retweeted”))

twitter R image - RT or Not

Hashtag or Not to Hashtag – That is the Question

My gut tells me that I hashtag a lot. But, to be sure, I am still going to look at the data, as any data scientist would do too. This time, my gut feeling was spot on. The analysis tells me that I actually use hashtags almost 3 times more than I don’t! In fact, I am sure this number should be higher because the approach I used searches for a single hashtag in a tweet and returns true. Many of my tweets have multiple hashtags.

ggplot(data=mytweets, aes(factor(grepl(“#”, mytweets$text)))) + # looks for a hashtag in a tweet
geom_bar(fill = “turquoise4”,  width = 0.3) +
xlab(“”) + ylab(“Number of Tweets”) +
ggtitle(“Tweets with and without hashtags”) +
scale_x_discrete(labels=c(“Without hashtags”, “With hashtag”))

twitter R image - Hashtag or Not

Recap

So there you have it, folks. This post showed readers how to get their twitter archive from Twitter, load it into Rstudio and analyse the data using R in search of  tweeting patterns. In the next post, I will be looking into creating a word corpus and will use that to build word clouds from my tweets! Stay tuned !

p.s. Follow me on Twitter or LinkedIn if you want to stay up to date on my posts!

Leave a Reply

Your email address will not be published. Required fields are marked *