A Guide to Mining and Analysing Tweets with R (2024)

Simple Steps to Writing an Insightful Twitter Analytics Report

Twitter provides us with vast amounts of user-generated language data — a dream for anyone wanting to conduct textual analysis. More than that, tweets allow us to gain insights into the online public behaviour. As such, analysing Twitter has become a crucial source of information for brands and agencies.

Several factors have given Twitter considerable advantages over other social media platforms for analysis. First, the limited character size of tweets provides us with a relatively homogeneous corpora. Second, the millions of tweets published everyday allows access to large data samples. Third, the tweets are publicly available and easily accessible as well as retrievable via APIs.

Nonetheless, extracting these insights still requires a bit of coding and programming knowledge. This is why, most often, brands and agencies rely on easy-to-use analytics tools such as SproutSocial and Talkwalker who provide these insights at a cost in just one click.

In this article, I help you to break down these barriers and provide you with a simple guide on how to extract and analyse tweets with the programming software R.

Here are 3 reasons why you might chose to do so:

  • Using R is for free, i.e. you will be able to produce a Twitter Analytics Report for free and learn how to code at the same time!
  • R allows you infinite opportunities for analysis. Using it to analyse Twitter therefore allows you to conduct tailor-made analysis depending on what you wish to analyse instead of relying on a one-size-fits-all report
  • R allows you to analyse any Twitter account you want even if you don’t have the log-in details. This is a huge advantage compared to many analytics tools that require you to have the log-in details in order to analyse the information in the first place.

Convinced? Let’s get started, then!

In order to get started, you first need to get a Twitter API. This will allow you to retrieve the tweets — without it, you cannot do anything. Getting a Twitter API is easy. First make sure you have a Twitter account, otherwise create one. Then, apply for a developer account via the following website: https://developer.twitter.com/en/apply-for-access.html. You’ll need to fill in an application form, which includes explaining a little a bit more what you wish you analyse.

Once you application has been accepted by Twitter (which doesn’t take too long), you’ll receive the following credentials that you need to keep safe:

  • Consumer key
  • Consumer Secret
  • Access Token
  • Access Secret

Once you have the information above, start R and download the package “rtweet”, which I will use to extract the tweets.

install.packages("rtweet")
library (rtweet)

Then, set up the authentification to connect to Twitter. You do this by entering the name of your app, consumer key and consumer secret — all of it is information you have received when applying for the Twitter API. You will be re-directed to a Twitter page and asked to accept the authentification. Once this is done, you can return to R and start the analysis of your tweets!

twitter_token <- create_token(
app = ****,
consumer_key = ****,
consumer_secret = ****,
set_renv = TRUE)

Searching for tweets

Depending on the analysis you wish to perform, you may want to search for tweets that contain a specific word or hashtag. Note that you can only extract tweets from the past 6 to 9 days, so keep this in mind for your analysis.

To do this, simply use the search_tweets function followed by a few specifications: the number of tweets to extract (n), whether or not to include retweets and the language of the tweets. As an example, see the line of code below.

climate <- search_tweets(“climate”, n=1000, include_rts=FALSE, lang=”en”)

Search for a specific user account

Alternatively, you may want to analyse a specific user account. In this case, use the get_timeline function followed by the twitter handle and number of tweets you wish to extract. Note that here you can only extract the last 3200 tweets.

In this example, I chose to exract the tweets of Bill Gates. The advantage here is that Bill Gates’ account counts 3169 tweets overall, which is under the 3200 threshold.

Gates <- get_timeline("@BillGates", n= 3200)

In this part, I show you 8 key insights you should include in every Twitter Analytics Report. To do this, let’s delve into the Twitter account of Bill Gates a bit more!

1. SHOW WHAT WORKS BEST AND WHAT DOESN’T

The first part of any report should deliver clear information as to what worked best and what didn’t. Finding out the best and least performing tweets gives a quick and clear overall picture of the account.

In order to do this, you first need to distinguish between organic tweets, retweets and replies. The following line of code shows you how to remove the retweets and replies from your sample to keep only the organic tweets — content-wise, these are the ones you want to analyse!

# Remove retweets
Gates_tweets_organic <- Gates_tweets[Gates_tweets$is_retweet==FALSE, ]
# Remove replies
Gates_tweets_organic <- subset(Gates_tweets_organic, is.na(Gates_tweets_organic$reply_to_status_id))

Then, you’ll want to analyse engagement by looking at the variables: favorite_count (i.e. the number of likes) or retweet_count (i.e. the number of retweets). Simply arrange them in descending order (with a minus “-” before the variable) to find the one with the highest number of likes or retweets or ascending order (without the minus) to find the one with lowest number of engagements.

Gates_tweets_organic <- Gates_tweets_organic %>% arrange(-favorite_count)
Gates_tweets_organic[1,5]
Gates_tweets_organic <- Gates_tweets_organic %>% arrange(-retweet_count)
Gates_tweets_organic[1,5]

2. SHOW THE RATIO OF REPLIES/RETWEETS/ORGANIC TWEETS

Analysing the ratio of replies, retweets and organic tweets can tell you a great deal about the type of account you’re analysing. No one likes a Twitter account that exclusively retweets for instance, without any individual content. Finding a good ratio of replies, retweets and organic tweets is therefore a key metric to monitor if one wishes to improve the performance of his or her account.

As a first step, make sure to create three different data sets. As you’ve already created a dataset containing only the organic tweets in the previous steps, simply now create a dataset containing only the retweets and one containing only the replies.

# Keeping only the retweets
Gates_retweets <- Gates_tweets[Gates_tweets$is_retweet==TRUE,]
# Keeping only the replies
Gates_replies <- subset(Gates_tweets, !is.na(Gates_tweets$reply_to_status_id))

Then, create a separate data frame containing the number of organic tweets, retweets, and replies. These numbers are easy to find: they are the number of observations for your three respective datasets.

# Creating a data frame
data <- data.frame(
category=c("Organic", "Retweets", "Replies"),
count=c(2856, 192, 120)
)

Once you’ve done that, you can start preparing your data frame for a donut chart as shown below. This includes adding columns that calculate the ratios and percentages and some visualisation tweaks such as specifying the legend and rounding up your data.

# Adding columns 
data$fraction = data$count / sum(data$count)
data$percentage = data$count / sum(data$count) * 100
data$ymax = cumsum(data$fraction)
data$ymin = c(0, head(data$ymax, n=-1))
# Rounding the data to two decimal points
data <- round_df(data, 2)
# Specify what the legend should say
Type_of_Tweet <- paste(data$category, data$percentage, "%")
ggplot(data, aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=Type_of_Tweet)) +
geom_rect() +
coord_polar(theta="y") +
xlim(c(2, 4)) +
theme_void() +
theme(legend.position = "right")
A Guide to Mining and Analysing Tweets with R (4)

3. SHOW WHEN THE TWEETS ARE PUBLISHED

Thanks to the date and hour extracted with each tweet, understanding when Bill Gates tweets most is very easy to analyse. This can give us an overall overview of the activity of the account and can be a useful metric to be analysed against the most and least performing tweets.

In this example, I analyse the frequency of tweets by year. Note that you can also do so by month by simply changing “year” to “month” in the following line of code. Alternatively, you can also analyse the publishing behaviour by hour with the R packages hms and scales.

colnames(Gates_tweets)[colnames(Gates_tweets)=="screen_name"] <- "Twitter_Account"ts_plot(dplyr::group_by(Gates_tweets, Twitter_Account), "year") +
ggplot2::theme_minimal() +
ggplot2::theme(plot.title = ggplot2::element_text(face = "bold")) +
ggplot2::labs(
x = NULL, y = NULL,
title = "Frequency of Tweets from Bill Gates",
subtitle = "Tweet counts aggregated by year",
caption = "\nSource: Data collected from Twitter's REST API via rtweet"
)
A Guide to Mining and Analysing Tweets with R (5)

4. SHOW FROM WHERE THE TWEETS ARE PUBLISHED

Analysing the source of the platform from which tweets are published is another cool insight to have. One of the reasons is that we can to a certain extent deduct whether or not Bill Gates is the one tweeting or not. As a result, this helps us define the personality of the tweets.

In this step, you’re interested in the source variable collected by the rtweet package. The following line of codes shows you how to aggregate this data by type of source and count the frequency of tweets for each type respectively. Note that I have only kept the sources for which more than 11 tweets were published to simplify the visualisation process.

Gates_app <- Gates_tweets %>% 
select(source) %>%
group_by(source) %>%
summarize(count=n())
Gates_app <- subset(Gates_app, count > 11)

Once this is done, the process is similar to the donut chart already created previously!

data <- data.frame(
category=Gates_app$source,
count=Gates_app$count
)
data$fraction = data$count / sum(data$count)
data$percentage = data$count / sum(data$count) * 100
data$ymax = cumsum(data$fraction)
data$ymin = c(0, head(data$ymax, n=-1))
data <- round_df(data, 2)Source <- paste(data$category, data$percentage, "%")ggplot(data, aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=Source)) +
geom_rect() +
coord_polar(theta="y") + # Try to remove that to understand how the chart is built initially
xlim(c(2, 4)) +
theme_void() +
theme(legend.position = "right")
A Guide to Mining and Analysing Tweets with R (6)

Note that most of the tweets from Bill Gates originate from Twitter Web Client, Sprinklr and Hootsuite — an indication that Bill Gates is most likely not the one tweeting himself!

5. SHOW THE MOST FREQUENT WORDS FOUND IN THE TWEETS

A Twitter Analytics Report should of course include an analysis of the content of the tweets and this includes finding out which words are used most.

Because you’re analysing textual data, make sure to clean it first and remove it from any character that you don’t want to show in your analysis such as hyperlinks, @ mentions or punctuations. The lines of code below provide you with basic cleaning steps for tweets.

Gates_tweets_organic$text <- gsub("https\\S*", "", Gates_tweets_organic$text)Gates_tweets_organic$text <- gsub("@\\S*", "", Gates_tweets_organic$text) Gates_tweets_organic$text <- gsub("amp", "", Gates_tweets_organic$text) Gates_tweets_organic$text <- gsub("[\r\n]", "", Gates_tweets_organic$text)Gates_tweets_organic$text <- gsub("[[:punct:]]", "", Gates_tweets_organic$text)

As a second step, make sure to remove stop words from the text. This is important for your analysis of the most frequent words as you don’t want the most common used words such as “to” or “and” to appear as these don’t carry much meaning for your analysis.

tweets <- Gates_tweets_organic %>%
select(text) %>%
unnest_tokens(word, text)
tweets <- tweets %>%
anti_join(stop_words)

You can then plot the most frequent words found in the tweets by following the simple steps below.

tweets %>% # gives you a bar chart of the most frequent words found in the tweets
count(word, sort = TRUE) %>%
top_n(15) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(y = "Count",
x = "Unique words",
title = "Most frequent words found in the tweets of Bill Gates",
subtitle = "Stop words removed from the list")
A Guide to Mining and Analysing Tweets with R (7)

6. SHOW THE MOST FREQUENTLY USED HASHTAGS

You can do the same analysis with the hashtags. In this case, you’ll want to use the hashtags variable from the rtweet package. A nice way to visualise these is using a word cloud as shown below.

Gates_tweets_organic$hashtags <- as.character(Gates_tweets_organic$hashtags)
Gates_tweets_organic$hashtags <- gsub("c\\(", "", Gates_tweets_organic$hashtags)
set.seed(1234)
wordcloud(Gates_tweets_organic$hashtags, min.freq=5, scale=c(3.5, .5), random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
A Guide to Mining and Analysing Tweets with R (8)

7. SHOW THE ACCOUNTS FROM WHICH MOST RETWEETS ORIGINATE

Retweeting extensively from one account is usually not what someone looks for in a Twitter account. A helpful insight is therefore to monitor and understand from which accounts most retweets originate. The variable you’ll want to analyse here is retweet_screen_name and the process to visualise it is similar to the one described previously using word clouds.

set.seed(1234)
wordcloud(Gates_retweets$retweet_screen_name, min.freq=3, scale=c(2, .5), random.order=FALSE, rot.per=0.25,
colors=brewer.pal(8, "Dark2"))
A Guide to Mining and Analysing Tweets with R (9)

8. PERFORM A SENTIMENT ANALYSIS OF THE TWEETS

Finally, you may want to add a sentiment analysis at the end of your Twitter Analytics Report. This is easy to do with the package “syuzhet” and allows you to further deepen your analysis by grasping the tone of the tweets. No one likes a Twitter account that only spreads angry or sad tweets. Capturing the tone of your tweets and how they balance out is a good indication of your account’s performance.

library(syuzhet)# Converting tweets to ASCII to trackle strange characters
tweets <- iconv(tweets, from="UTF-8", to="ASCII", sub="")
# removing retweets, in case needed
tweets <-gsub("(RT|via)((?:\\b\\w*@\\w+)+)","",tweets)
# removing mentions, in case needed
tweets <-gsub("@\\w+","",tweets)
ew_sentiment<-get_nrc_sentiment((tweets))
sentimentscores<-data.frame(colSums(ew_sentiment[,]))
names(sentimentscores) <- "Score"sentimentscores <- cbind("sentiment"=rownames(sentimentscores),sentimentscores)rownames(sentimentscores) <- NULLggplot(data=sentimentscores,aes(x=sentiment,y=Score))+
geom_bar(aes(fill=sentiment),stat = "identity")+
theme(legend.position="none")+
xlab("Sentiments")+ylab("Scores")+
ggtitle("Total sentiment based on scores")+
theme_minimal()
A Guide to Mining and Analysing Tweets with R (10)

In this article, I aimed to show how to extract and analyse tweets using the free-to-use programming software R. I hope you found this guide helpful to build your own Twitter Analytics Report that includes:

  • Showing which tweets worked best and which didn’t
  • The ratio of organic tweets/replies/retweets, the time of tweet publication and the platforms from which tweets are published. These are all insights regarding the tweeting behaviour.
  • The most frequent words used in the tweets, hashtags, from which accounts most retweets originate and a sentiment analysis capturing the tone of the tweets. These are all insights on the content of the tweets.

I regularly write articles about Data Science and Natural Language Processing. Follow me on Twitter or Medium to check out more articles like these or simply to keep updated about the next ones!

A Guide to Mining and Analysing Tweets with R (2024)

FAQs

How to extract tweets from Twitter using R? ›

You can replace with any other person twitter profile name by changing in single quotes and “n” is value to extract no of tweets. Now Run the code in lines 31 and 32 to create a CSV file of Modi's tweets and to know the location of CVS file type “getwd()” which will show path of the CSV file as show in the below image.

How to do sentiment analysis on Twitter data using r? ›

Twitter Sentiment Analysis and Visualization using R
  1. Step 1: Load the required packages (including rtweet) in RStudio.
  2. Step 2: Authenticate using your credentials to Twitter's API by creating an access token.
Jan 19, 2020

How do you analyze tweets on Twitter? ›

To analyze tweets effectively, you need to define your objective, collect and clean the data, analyze the data using relevant methods, interpret the results, and draw conclusions. By following these steps, you can gain a deeper understanding of Twitter conversations and make informed decisions based on your findings.

What is Twitter mining? ›

Twitter mining collects and analyses large amounts of data from the Twitter platform. This data can include tweets, user profiles, and other information related to the Twitter activity.

How do you scrape data from tweets? ›

How to scrape Twitter data using Python
  1. Set up a Twitter developer account to access Twitter's API.
  2. Generate your API keys and access tokens.
  3. Install the Python library (e.g., Tweepy, Snscrape) using pip.
  4. Authenticate your requests to the Twitter API using your API keys and access tokens.
Jun 22, 2024

Can you Wayback Machine tweets? ›

Use Wayback Machine

The Wayback Machine is a neat website for finding deleted Tweets. It doesn't archive Tweets but has a library of screenshots from popular accounts. Here's how to use it. Visit Wayback Machine's website, type “twitter.com/” and add the Twitter handle you want to search.

Is the Twitter API still free? ›

Is it official that free access to the Twitter API is no longer available despite it still being an option on the developer site? As announced previously, we have deprecated our legacy Twitter API access tiers. Thus, essential and elevated packages are no longer available.

What has happened to Twitter Analytics? ›

Users can no longer track their tweet metrics because they cannot access the tool's feature on mobile devices. However, this does not spell the end of Twitter analytics. The tool is still very active and reliable. You can use Twitter analytics on the web to view much more metrics than provided on the mobile app.

What is Shamining? ›

Shami Meaning

Shami The meaning of the name is Fire, Name of a tree, Work. By keeping the name Shami, your child also starts behaving like the meaning of this name.

How to gather Twitter data? ›

Step 1: On Twitter, identify a popular tweet and URL you want to examine. Step 2: After doing this, you're ready to collect public replies to this tweet. In Communalytic, from the “My Datasets” page, click on the “X Thread” button under the Twitter tab to start data collection.

What is hash in Twitter? ›

A hashtag is a metadata tag that is prefaced by the hash symbol, #. On social media, hashtags are used on microblogging and photo-sharing services such as Twitter or Tumblr as a form of user-generated tagging that enables cross-referencing of content by topic or theme.

How do I export Twitter tweets? ›

Step 1: Log in to your Twitter account and click on “More” from the navigation bar. Step 2: From the drop-down menu, select “Settings and Privacy.” Step 3: On the next page, click on “Your Account” from the left-hand menu. Step 4: Scroll down to the bottom of the page and click on “Download an archive of your data.”

How do I extract all my tweets? ›

To get started, head to Twitter in your web browser, open the settings, and click the Download an archive of your data button. You will be asked to verify your identity via SMS or email, after which you'll be told that Twitter is processing your request.

How do I extract media from Twitter? ›

How to extract the tweets from the Media section on Twitter?
  1. Create a free Phantombuster account.
  2. Connect to Twitter using PhantomBuster's browser extension.
  3. Specify the users you want to scrape Media section's tweets from.
  4. Set the Phantom on repeat.
  5. Download the tweets from the media section to a . CSV spreadsheet or a .

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Duncan Muller

Last Updated:

Views: 6302

Rating: 4.9 / 5 (59 voted)

Reviews: 82% of readers found this page helpful

Author information

Name: Duncan Muller

Birthday: 1997-01-13

Address: Apt. 505 914 Phillip Crossroad, O'Konborough, NV 62411

Phone: +8555305800947

Job: Construction Agent

Hobby: Shopping, Table tennis, Snowboarding, Rafting, Motor sports, Homebrewing, Taxidermy

Introduction: My name is Duncan Muller, I am a enchanting, good, gentle, modern, tasty, nice, elegant person who loves writing and wants to share my knowledge and understanding with you.