Ever wondered what kind of tweeter you are? What do you tweet about the most? Do you tweet more on weekdays like me, or are you more of a weekend tweeter? Are you hashtag prone and pepper your tweets with hashtags or do you completely forget to hashtag sometimes. I am the former type but I am trying to kick the habit! It’s not easy, given that twitter lets you hashtag anything and everything. This post will describe how you can get your own twitter archive and analyze it from different angles.
Let’s get that Twitter Data
After finishing my course on Data Analytics using R, I wanted to learn more about R API’s available for text manipulation, text analysis and plotting. Since, I am a regular tweeter, and my tweeting data is readily available, I though why not use R to find out more about my tweeting habits. Twitter lets you easily access your tweets in a nicely formatted excel file. All you have to do is browse to your Twitter Profile > Settings page and look for the “Your Twitter Data” option on the bottom left. Once you click on that, the next page should have the option for downloading your entire Tweet history
This leads to the option as seen below, of getting Twitter to email you your archive. My screenshot shows a “Resend email” option because I just requested an archive. You should receive an email within a few minutes, if not seconds, with your data in a zip format.
Review the Data
By default, the data contains all the columns in the image below. You have the option of removing any columns you think you won’t need for the analysis you do, but I kept them all, just in case. I did. This is what my twitter data looked like after I did some minor formatting. I had about 800 tweets in my data set.
Load your Twitter data into RStudio
Use a simple read command in R to load the data.
setwd(‘C:/Documents/RStudio Working Directory’) # Set the working directory where the tweets.csv resides. Yours can be different
mytweets <- read.csv(“Salman Twitter Archive/tweets.csv”, stringsAsFactors = FALSE) # “stringsAsFactors = FALSE” turns off automatic conversion of character strings to factors.
Use the summary function to check type of each column
You will notice that the timestamp column is of type character. To do any time-based analysis, we need the timestamp column to be of type date/time which will play well with the plotting functions. Let’s fix this using the lubridate package which provides tools that make it easier to parse and manipulate dates. To me, it makes most sense to convert the date into a Year Month Data format so here we go.
mytweets$timestamp <- ymd_hms(mytweets$timestamp)
You can run the summary command again to confirm that the timetstamp is now of type date.
To learn more about lubridate and other packages, you can check out R-bloggers website.
Plot some graphs!
Install the ggplot2 package if you haven’t already and load the ggplot2 library
Tweets over time
The code below will plot your tweets over time. The more the tweets for that time period, the darker the color shade
ggplot(data = mytweets, aes(x = timestamp)) +
geom_histogram(aes(fill = ..count..)) +
theme(legend.position = “none”) +
xlab(“Time”) + ylab(“Number of tweets”) +
scale_fill_gradient(low = “red”, high = “blue”) # decides shade based on frequency of tweets. Low frequency = red shade.
Use “fill = ..count..” with geom_histogram if you want to the intensity of the histogram to reflect its height.
Just glancing at the plot quickly, I can tell what I was up to over the last year. I opened my Twitter account in April last year, right after my data management course which introduced me to ways of capturing and analyzing social media data. The highs in the summer time reflects longer days which meant I was more active and probably had more time to tweet. I think productivity is directly proportional to the weather. At least for me, summer time is when I am most productive. The radio silence in late August and early September made sense because I was vacationing in Karachi/Istanbul and completely disconnected myself from social media. The frequency of my tweets slowed down in November, as I was likely reeling from the unexpected results of the US elections or maybe it was all those R assignments I was working on. Either way, I recovered and started strongly in January.
Busiest tweeting day of the week
I applied the wday method on the timestamp columns to categorize the data into weekdays and then used this to plot this graph.
ggplot(data = mytweets, aes(x = wday(timestamp, label = TRUE))) +
geom_bar(fill = “springgreen4”, width = 0.4) +
ggtitle(“Tweets by Weekday” ) +
xlab(“Day of Week”) + ylab(“Number of tweets”)
Clearly I am not an avid weekend tweeter. Maybe you are!
Tweets vs Retweets
Before looking at the results of this analysis, I would have guessed that I retweet more than I tweet. But the data tells me otherwise. A reminder to always double check your gut instincts with data driven analysis.
ggplot(data=mytweets, aes(factor(!is.na(mytweets$retweeted_status_id)))) +
geom_bar(fill = “coral2”) +
xlab(“”) + ylab(“Number of tweets”) +
ggtitle(“Retweeted or Not” ) +
scale_x_discrete(labels=c(“Not Retweeted”, “Retweeted”))
Hashtag or Not to Hashtag – That is the Question
My gut tells me that I hashtag a lot. But, to be sure, I am still going to look at the data, as any data scientist would do too. This time, my gut feeling was spot on. The analysis tells me that I actually use hashtags almost 3 times more than I don’t! In fact, I am sure this number should be higher because the approach I used searches for a single hashtag in a tweet and returns true. Many of my tweets have multiple hashtags.
ggplot(data=mytweets, aes(factor(grepl(“#”, mytweets$text)))) + # looks for a hashtag in a tweet
geom_bar(fill = “turquoise4”, width = 0.3) +
xlab(“”) + ylab(“Number of Tweets”) +
ggtitle(“Tweets with and without hashtags”) +
scale_x_discrete(labels=c(“Without hashtags”, “With hashtag”))
So there you have it, folks. This post showed readers how to get their twitter archive from Twitter, load it into Rstudio and analyse the data using R in search of tweeting patterns. In the next post, I will be looking into creating a word corpus and will use that to build word clouds from my tweets! Stay tuned !