Lesson 5. Analyze The Sentiment of Tweets From Twitter Data and Tweepy in Python


Learning Objectives

After completing this tutorial, you will be able to:

  • Explain how text data can be analyzed to identify sentiments (i.e. attitudes) toward a particular subject.
  • Analyze sentiments in tweets.

What You Need

You will need a computer with internet access to complete this lesson.

Sentiment Analysis

Sentiment analysis is a method of identifying attitudes in text data about a subject of interest. It is scored using polarity values that range from 1 to -1. Values closer to 1 indicate more positivity, while values closer to -1 indicate more negativity.

In this lesson, you will apply sentiment analysis to Twitter data using the Python package textblob. You will calculate a polarity value for each tweet on a given subject and then plot these values in a histogram to identify the overall sentiment toward the subject of interest.

Begin by reviewing how to search for and clean tweets that you will use to analyze sentiments in Twitter data.

test - force trigger rebuild

import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import itertools
import collections

import tweepy as tw
import nltk
from nltk.corpus import stopwords
import re
import networkx
from textblob import TextBlob

import warnings
warnings.filterwarnings("ignore")

sns.set(font_scale=1.5)
sns.set_style("whitegrid")

Remember to define your keys:

consumer_key= 'yourkeyhere'
consumer_secret= 'yourkeyhere'
access_token= 'yourkeyhere'
access_token_secret= 'yourkeyhere'
auth = tw.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tw.API(auth, wait_on_rate_limit=True)

Using what you have learned in the previous lessons, grab and clean up 1000 recent tweets. For this analysis, you only need to remove URLs from the tweets.

def remove_url(txt):
    """Replace URLs found in a text string with nothing 
    (i.e. it will remove the URL from the string).

    Parameters
    ----------
    txt : string
        A text string that you want to parse and remove urls.

    Returns
    -------
    The same txt string with url's removed.
    """

    return " ".join(re.sub("([^0-9A-Za-z \t])|(\w+:\/\/\S+)", "", txt).split())
# Create a custom search term and define the number of tweets
search_term = "#climate+change -filter:retweets"

tweets = tw.Cursor(api.search,
                   q=search_term,
                   lang="en",
                   since='2018-11-01').items(1000)

# Remove URLs
tweets_no_urls = [remove_url(tweet.text) for tweet in tweets]

Analyze Sentiments in Tweets

You can use the Python package textblob to calculate the polarity values of individual tweets on climate change.

Begin by creating textblob objects, which assigns polarity values to the tweets. You can identify the polarity value using the attribute .polarity of texblob object.

# Create textblob objects of the tweets
sentiment_objects = [TextBlob(tweet) for tweet in tweets_no_urls]

sentiment_objects[0].polarity, sentiment_objects[0]
(0.0,
 TextBlob("What does the virus do for the outlook and fond hope of climate change alarmists who want to get everyone travellin"))

You can apply list comprehension to create a list of the polarity values and text for each tweet, and then create a Pandas Dataframe from the list.

# Create list of polarity valuesx and tweet text
sentiment_values = [[tweet.sentiment.polarity, str(tweet)] for tweet in sentiment_objects]

sentiment_values[0]
[0.0,
 'What does the virus do for the outlook and fond hope of climate change alarmists who want to get everyone travellin']
# Create dataframe containing the polarity value and tweet text
sentiment_df = pd.DataFrame(sentiment_values, columns=["polarity", "tweet"])

sentiment_df.head()
polaritytweet
00.0What does the virus do for the outlook and fon...
10.6In addition to COVID19 the world also faces an...
20.0Heres the state of our climate the news is no ...
30.0BBC News Climate change Warming clips the nigh...
40.2This is one to read now and follow up on Clima...

These polarity values can be plotted in a histogram, which can help to highlight in the overall sentiment (i.e. more positivity or negativity) toward the subject.

fig, ax = plt.subplots(figsize=(8, 6))

# Plot histogram of the polarity values
sentiment_df.hist(bins=[-1, -0.75, -0.5, -0.25, 0.25, 0.5, 0.75, 1],
             ax=ax,
             color="purple")

plt.title("Sentiments from Tweets on Climate Change")
plt.show()
This plot displays a histogram of polarity values for tweets on climate change.
This plot displays a histogram of polarity values for tweets on climate change.

To get a better visual of the polarit values, it can be helpful to remove the polarity values equal to zero and create a break in the histogram at zero.

# Remove polarity values equal to zero
sentiment_df = sentiment_df[sentiment_df.polarity != 0]
fig, ax = plt.subplots(figsize=(8, 6))

# Plot histogram with break at zero
sentiment_df.hist(bins=[-1, -0.75, -0.5, -0.25, 0.0, 0.25, 0.5, 0.75, 1],
             ax=ax,
             color="purple")

plt.title("Sentiments from Tweets on Climate Change")
plt.show()
This plot displays a revised histogram of polarity values for tweets on climate change. For this histogram, polarity values equal to zero have been removed, and a break has been added at zero, to better highlight the distribution of polarity values.
This plot displays a revised histogram of polarity values for tweets on climate change. For this histogram, polarity values equal to zero have been removed, and a break has been added at zero, to better highlight the distribution of polarity values.

What does the histogram of the polarity values tell you about sentiments in the tweets gathered from the search “#climate+change -filter:retweets”? Are they more positive or negative?

Next, explore a new topic, the 2018 Camp Fire in California.

Begin by searching for the tweets and combining the cleaning of the data (i.e. removing URLs) with the creation of the textblob objects.

search_term = "#CampFire -filter:retweets"

tweets = tw.Cursor(api.search,
                   q=search_term,
                   lang="en",
                   since='2018-09-23').items(1000)

# Remove URLs and create textblob object for each tweet
all_tweets_no_urls = [TextBlob(remove_url(tweet.text)) for tweet in tweets]

all_tweets_no_urls[:5]
[TextBlob("No this isnt a dream nor an April Fools joke This is what I entirely thought of the W branding in general"),
 TextBlob("FREE checklist 19 ways to connect with young people during this time from SearchInstitute"),
 TextBlob("Its really tough to think we have to do something this hard yet again But we will campfire Paradisestrong"),
 TextBlob("SalesmanCalendar Sample Vintage 1956 Boys at Campfire FREE USA SHIP"),
 TextBlob("For the first day of NorthlandScouts CampAtHome initiative weve got a virtual campfire by Anna from the 93rd N")]

Then, you can create the Pandas Dataframe of the polarity values and plot the histogram for the Camp Fire tweets, just like you did for the climate change data.

# Calculate polarity of tweets
wild_sent_values = [[tweet.sentiment.polarity, str(tweet)] for tweet in all_tweets_no_urls]

# Create dataframe containing polarity values and tweet text
wild_sent_df = pd.DataFrame(wild_sent_values, columns=["polarity", "tweet"])
wild_sent_df = wild_sent_df[wild_sent_df.polarity != 0]

wild_sent_df.head()
polaritytweet
00.025000No this isnt a dream nor an April Fools joke T...
10.250000FREE checklist 19 ways to connect with young p...
2-0.340278Its really tough to think we have to do someth...
30.400000SalesmanCalendar Sample Vintage 1956 Boys at C...
40.250000For the first day of NorthlandScouts CampAtHom...
fig, ax = plt.subplots(figsize=(8, 6))

wild_sent_df.hist(bins=[-1, -0.75, -0.5, -0.25, 0, 0.25, 0.5, 0.75, 1],
        ax=ax, color="purple")

plt.title("Sentiments from Tweets on the Camp Fire")
plt.show()
This plot displays a histogram of polarity values for tweets on the Camp Fire in California. For this histogram, polarity values equal to zero have been removed and a break has been added at zero, to better highlight the distribution of polarity values.
This plot displays a histogram of polarity values for tweets on the Camp Fire in California. For this histogram, polarity values equal to zero have been removed and a break has been added at zero, to better highlight the distribution of polarity values.

Based on this histogram, would you say that the sentiments from the Camp Fire tweets are more positive or negative?

Leave a Comment