THIS CONTENT IS MOVING!

We are moving our course lessons to an improved textbook series. All of the same content will be improved and available by the end of Spring 2020. While these pages will automagically redirect, you can also visit the links below to check out our new content! Our course landing pages with associated readings and assignments will stay here so you can continue to follow along with our courses!

Lesson 5. Analyze Sentiments Using Twitter Data and Tweepy in Python


Learning Objectives

After completing this tutorial, you will be able to:

  • Explain how text data can be analyzed to identify sentiments (i.e. attitudes) toward a particular subject.
  • Analyze sentiments in tweets.

What You Need

You will need a computer with internet access to complete this lesson.

Sentiment Analysis

Sentiment analysis is a method of identifying attitudes in text data about a subject of interest. It is scored using polarity values that range from 1 to -1. Values closer to 1 indicate more positivity, while values closer to -1 indicate more negativity.

In this lesson, you will apply sentiment analysis to Twitter data using the Python package textblob. You will calculate a polarity value for each tweet on a given subject and then plot these values in a histogram to identify the overall sentiment toward the subject of interest.

Begin by reviewing how to search for and clean tweets that you will use to analyze sentiments in Twitter data.

test - force trigger rebuild

import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import itertools
import collections

import tweepy as tw
import nltk
from nltk.corpus import stopwords
import re
import networkx
from textblob import TextBlob

import warnings
warnings.filterwarnings("ignore")

sns.set(font_scale=1.5)
sns.set_style("whitegrid")

Remember to define your keys:

consumer_key= 'yourkeyhere'
consumer_secret= 'yourkeyhere'
access_token= 'yourkeyhere'
access_token_secret= 'yourkeyhere'
auth = tw.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tw.API(auth, wait_on_rate_limit=True)

Using what you have learned in the previous lessons, grab and clean up 1000 recent tweets. For this analysis, you only need to remove URLs from the tweets.

def remove_url(txt):
    """Replace URLs found in a text string with nothing 
    (i.e. it will remove the URL from the string).

    Parameters
    ----------
    txt : string
        A text string that you want to parse and remove urls.

    Returns
    -------
    The same txt string with url's removed.
    """

    return " ".join(re.sub("([^0-9A-Za-z \t])|(\w+:\/\/\S+)", "", txt).split())
# Create a custom search term and define the number of tweets
search_term = "#climate+change -filter:retweets"

tweets = tw.Cursor(api.search,
                   q=search_term,
                   lang="en",
                   since='2018-11-01').items(1000)

# Remove URLs
tweets_no_urls = [remove_url(tweet.text) for tweet in tweets]

Analyze Sentiments in Tweets

You can use the Python package textblob to calculate the polarity values of individual tweets on climate change.

Begin by creating textblob objects, which assigns polarity values to the tweets. You can identify the polarity value using the attribute .polarity of texblob object.

# Create textblob objects of the tweets
sentiment_objects = [TextBlob(tweet) for tweet in tweets_no_urls]

sentiment_objects[0].polarity, sentiment_objects[0]
(-0.06666666666666665,
 TextBlob("Januarys dry handsravaged our landturning timberto ashsinew tosmokelivelihood nogoodonly words some holl"))

You can apply list comprehension to create a list of the polarity values and text for each tweet, and then create a Pandas Dataframe from the list.

# Create list of polarity valuesx and tweet text
sentiment_values = [[tweet.sentiment.polarity, str(tweet)] for tweet in sentiment_objects]

sentiment_values[0]
[-0.06666666666666665,
 'Januarys dry handsravaged our landturning timberto ashsinew tosmokelivelihood nogoodonly words some holl']
# Create dataframe containing the polarity value and tweet text
sentiment_df = pd.DataFrame(sentiment_values, columns=["polarity", "tweet"])

sentiment_df.head()
polaritytweet
0-0.066667Januarys dry handsravaged our landturning timb...
10.050000Quiet Australians speak up on climate change a...
20.100000BBC News Greta Thunberg seeks Africa climate c...
30.100000Climate change calls for action not adaptation...
40.000000science Links to videos about CLIMATE CHANGE a...

These polarity values can be plotted in a histogram, which can help to highlight in the overall sentiment (i.e. more positivity or negativity) toward the subject.

fig, ax = plt.subplots(figsize=(8, 6))

# Plot histogram of the polarity values
sentiment_df.hist(bins=[-1, -0.75, -0.5, -0.25, 0.25, 0.5, 0.75, 1],
             ax=ax,
             color="purple")

plt.title("Sentiments from Tweets on Climate Change")
plt.show()
This plot displays a histogram of polarity values for tweets on climate change.
This plot displays a histogram of polarity values for tweets on climate change.

To get a better visual of the polarit values, it can be helpful to remove the polarity values equal to zero and create a break in the histogram at zero.

# Remove polarity values equal to zero
sentiment_df = sentiment_df[sentiment_df.polarity != 0]
fig, ax = plt.subplots(figsize=(8, 6))

# Plot histogram with break at zero
sentiment_df.hist(bins=[-1, -0.75, -0.5, -0.25, 0.0, 0.25, 0.5, 0.75, 1],
             ax=ax,
             color="purple")

plt.title("Sentiments from Tweets on Climate Change")
plt.show()
This plot displays a revised histogram of polarity values for tweets on climate change. For this histogram, polarity values equal to zero have been removed, and a break has been added at zero, to better highlight the distribution of polarity values.
This plot displays a revised histogram of polarity values for tweets on climate change. For this histogram, polarity values equal to zero have been removed, and a break has been added at zero, to better highlight the distribution of polarity values.

What does the histogram of the polarity values tell you about sentiments in the tweets gathered from the search “#climate+change -filter:retweets”? Are they more positive or negative?

Next, explore a new topic, the 2018 Camp Fire in California.

Begin by searching for the tweets and combining the cleaning of the data (i.e. removing URLs) with the creation of the textblob objects.

search_term = "#CampFire -filter:retweets"

tweets = tw.Cursor(api.search,
                   q=search_term,
                   lang="en",
                   since='2018-09-23').items(1000)

# Remove URLs and create textblob object for each tweet
all_tweets_no_urls = [TextBlob(remove_url(tweet.text)) for tweet in tweets]

all_tweets_no_urls[:5]
[TextBlob("Grab a tripod hanging cooking pot before your next camping trip Go to Yazoos Outdoor World to get your camping ki"),
 TextBlob("Colorful Campfire Desert Sands Vintage Trailer RV Park"),
 TextBlob("Wishing everyone a happy Friday evening with good friends juniorretreat campfire smores roastingmarshmellows"),
 TextBlob("Nice campfire tonightcampfire campfires"),
 TextBlob("Solo Overnight With My NEW Emergency Kit and Campfire BBQ Beans and Weenies Check it Out Now at Corporals Corner o")]

Then, you can create the Pandas Dataframe of the polarity values and plot the histogram for the Camp Fire tweets, just like you did for the climate change data.

# Calculate polarity of tweets
wild_sent_values = [[tweet.sentiment.polarity, str(tweet)] for tweet in all_tweets_no_urls]

# Create dataframe containing polarity values and tweet text
wild_sent_df = pd.DataFrame(wild_sent_values, columns=["polarity", "tweet"])
wild_sent_df = wild_sent_df[wild_sent_df.polarity != 0]

wild_sent_df.head()
polaritytweet
10.300000Colorful Campfire Desert Sands Vintage Trailer...
20.750000Wishing everyone a happy Friday evening with g...
30.600000Nice campfire tonightcampfire campfires
40.136364Solo Overnight With My NEW Emergency Kit and C...
50.633333Gorgeous sunset and lovely fire Life is good s...
fig, ax = plt.subplots(figsize=(8, 6))

wild_sent_df.hist(bins=[-1, -0.75, -0.5, -0.25, 0, 0.25, 0.5, 0.75, 1],
        ax=ax, color="purple")

plt.title("Sentiments from Tweets on the Camp Fire")
plt.show()
This plot displays a histogram of polarity values for tweets on the Camp Fire in California. For this histogram, polarity values equal to zero have been removed and a break has been added at zero, to better highlight the distribution of polarity values.
This plot displays a histogram of polarity values for tweets on the Camp Fire in California. For this histogram, polarity values equal to zero have been removed and a break has been added at zero, to better highlight the distribution of polarity values.

Based on this histogram, would you say that the sentiments from the Camp Fire tweets are more positive or negative?

Leave a Comment