Lesson 2. Get and Work With Twitter Data in Python Using Tweepy


Learning Objectives

After completing this tutorial, you will be able to:

  • Connect to the twitter RESTful API to access twitter data with Python.
  • Generate custom queries that download tweet data into Python using Tweepy.
  • Access tweet metadata including users in Python using Tweepy.

What You Need

You will need a computer with internet access to complete this lesson.

In this lesson, you will explore analyzing social media data accessed from Twitter using Python. You will use the Twitter RESTful API to access data about both Twitter users and what they are tweeting about.

Getting Started

To get started, you’ll need to do the following things:

  1. Set up a Twitter account if you don’t have one already.
  2. Using your Twitter account, you will need to apply for Developer Access and then create an application that will generate the API credentials that you will use to access Twitter from Python.
  3. Import the tweepy package.

Once you’ve done these things, you are ready to begin querying Twitter’s API to see what you can learn about tweets!

Set up Twitter App

After you have applied for Developer Access, you can create an application in Twitter that you can use to access tweets. Make sure you already have a Twitter account.

To create your application, you can follow a useful tutorial from rtweet, which includes a section on Create an application that is not specific to R:

TUTORIAL: How to setup a Twitter application using your Twitter account

NOTE: you will need to provide a phone number that can receive text messages (e.g. mobile or Google phone number) to Twitter to verify your use of the API.

Image showing tweet activity across Boulder and Denver.
A heat map of the distribution of tweets across the Denver / Boulder region. Source: socialmatt.com

Access Twitter API in Python

Once you have your Twitter app set-up, you are ready to access tweets in Python. Begin by importing the necessary Python libraries.

import tweepy as tw
import pandas as pd

To access the Twitter API, you will need 4 things from the your Twitter App page. These keys are located in your Twitter app settings in the Keys and Access Tokens tab.

  • consumer key
  • consumer seceret key
  • access token key
  • access token secret key

Do not share these with anyone else because these values are specific to your app.

First you will need define your keys

consumer_key= 'yourkeyhere'
consumer_secret= 'yourkeyhere'
access_token= 'yourkeyhere'
access_token_secret= 'yourkeyhere'
auth = tw.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tw.API(auth, wait_on_rate_limit=True)

Send a Tweet

You can send tweets using your API access. Note that your tweet needs to be 280 characters or less.

# Post a tweet from Python
api.update_status("Look, I'm tweeting from #Python in my #earthanalytics class! @EarthLabCU")
# Your tweet has been posted!

Search Twitter for Tweets

Now you are ready to search Twitter for recent tweets! Start by finding recent tweets that use the #wildfires hashtag. You will use the .Cursor method to get an object containing tweets containing the hashtag #wildfires.

To create this query, you will define the:

  1. Search term - in this case #wildfires
  2. the start date of your search

Remember that the Twitter API only allows you to access the past few weeks of tweets, so you cannot dig into the history too far.

# Define the search term and the date_since date as variables
search_words = "#wildfires"
date_since = "2018-11-16"

Below you use .Cursor() to search twitter for tweets containing the search term #wildfires. You can restrict the number of tweets returned by specifying a number in the .items() method. .items(5) will return 5 of the most recent tweets.

# Collect tweets
tweets = tw.Cursor(api.search,
              q=search_words,
              lang="en",
              since=date_since).items(5)
tweets
<tweepy.cursor.ItemIterator at 0x7fafc296e400>

.Cursor() returns an object that you can iterate or loop over to access the data collected. Each item in the iterator has various attributes that you can access to get information about each tweet including:

  1. the text of the tweet
  2. who sent the tweet
  3. the date the tweet was sent

and more. The code below loops through the object and prints the text associated with each tweet.

# Collect tweets
tweets = tw.Cursor(api.search,
              q=search_words,
              lang="en",
              since=date_since).items(5)

# Iterate on tweets
for tweet in tweets:
    print(tweet.text)
2/2 provide forest products to local mills, provide jobs to local communities, and improve the ecological health of… https://t.co/XemzXvyPyX
1/2 Obama's Forest Service Chief in 2015 --&gt;"Treating these acres through commercial thinning, hazardous fuels remo… https://t.co/01obvjezQW
RT @EnviroEdgeNews: US-#Volunteers care for abandoned #pets found in #California #wildfires; #Dogs, #cats, [#horses], livestock get care an…
RT @FairWarningNews: The wildfires that ravaged CA have been contained, but the health impacts from the resulting air pollution will be sev…
RT @chiarabtownley: If you know anybody who has been affected by the wildfires, please refer them to @awarenow_io It is one of the companie…

The above approach uses a standard for loop. However, this is an excellent place to use a Python list comprehension. A list comprehension provides an efficient way to collect object elements contained within an iterator as a list.

# Collect tweets
tweets = tw.Cursor(api.search,
                       q=search_words,
                       lang="en",
                       since=date_since).items(5)

# Collect a list of tweets
[tweet.text for tweet in tweets]
['2/2 provide forest products to local mills, provide jobs to local communities, and improve the ecological health of… https://t.co/XemzXvyPyX',
 '1/2 Obama\'s Forest Service Chief in 2015 --&gt;"Treating these acres through commercial thinning, hazardous fuels remo… https://t.co/01obvjezQW',
 'RT @EnviroEdgeNews: US-#Volunteers care for abandoned #pets found in #California #wildfires; #Dogs, #cats, [#horses], livestock get care an…',
 'RT @FairWarningNews: The wildfires that ravaged CA have been contained, but the health impacts from the resulting air pollution will be sev…',
 'RT @chiarabtownley: If you know anybody who has been affected by the wildfires, please refer them to @awarenow_io It is one of the companie…']

To Keep or Remove Retweets

A retweet is when someone shares someone else’s tweet. It is similar to sharing in Facebook. Sometimes you may want to remove retweets as they contain duplicate content that might skew your analysis if you are only looking at word frequency. Other times, you may want to keep retweets.

Below you ignore all retweets by adding -filter:retweets to your query. The Twitter API documentation has information on other ways to customize your queries.

new_search = search_words + " -filter:retweets"
new_search
'#wildfires -filter:retweets'
tweets = tw.Cursor(api.search,
                       q=new_search,
                       lang="en",
                       since=date_since).items(5)

[tweet.text for tweet in tweets]
['2/2 provide forest products to local mills, provide jobs to local communities, and improve the ecological health of… https://t.co/XemzXvyPyX',
 '1/2 Obama\'s Forest Service Chief in 2015 --&gt;"Treating these acres through commercial thinning, hazardous fuels remo… https://t.co/01obvjezQW',
 '"Start packing up!" Video shows how gender-reveal stunt sparked wildfire https://t.co/gvfNLI8NbO #Heatwave #Wildfires',
 'The pictures and stories coming out of the California #wildfires are heartbreaking, but there are plenty of good st… https://t.co/s4D7JB3VGu',
 'The wildfires that ravaged CA have been contained, but the health impacts from the resulting air pollution will be… https://t.co/bSdg9uHkqH']

Who is Tweeting About Wildfires?

You can access a wealth of information associated with each tweet. Below is an example of accessing the users who are sending the tweets related to #wildfires and their locations. Note that user locations are manually entered into Twitter by the user. Thus, you will see a lot of variation in the format of this value.

  • tweet.user.screen_name provides the user’s twitter handle associated with each tweet.
  • tweet.user.location provides the user’s provided location.

You can experiment with other items available within each tweet by typing tweet. and using the tab button to see all of the available attributes stored.

tweets = tw.Cursor(api.search, 
                           q=new_search,
                           lang="en",
                           since=date_since).items(5)


users_locs = [[tweet.user.screen_name, tweet.user.location] for tweet in tweets]
users_locs
[['TamaraHinton', 'Washington, DC'],
 ['TamaraHinton', 'Washington, DC'],
 ['robinsnewswire', "RT's Are FYI Purposes Only"],
 ['PublicityErika', 'Seattle area'],
 ['FairWarningNews', 'Los Angeles']]

Create a Pandas Dataframe From A List of Tweet Data

One you have a list of items that you wish to work with, you can create a pandas dataframe that contains that data.

tweet_text = pd.DataFrame(data=users_locs, 
                    columns=['user', "location"])
tweet_text
userlocation
0TamaraHintonWashington, DC
1TamaraHintonWashington, DC
2robinsnewswireRT's Are FYI Purposes Only
3PublicityErikaSeattle area
4FairWarningNewsLos Angeles

Customizing Twitter Queries

As mentioned above, you can customize your Twitter search queries by following the Twitter API documentation.

For instance, if you search for climate+change, Twitter will return all tweets that contain both of those words (in a row) in each tweet.

Note that the code below creates a list that can be queried using Python indexing to return the first five tweets.

new_search = "climate+change -filter:retweets"

tweets = tw.Cursor(api.search,
                   q=new_search,
                   lang="en",
                   since='2018-04-23').items(1000)

all_tweets = [tweet.text for tweet in tweets]
all_tweets[:5]
['“But as climate change is happening in real time, the practice of climate science...has never been at greater risk.… https://t.co/hIo1VtDnfW',
 '“Climate Change isn’t real.” https://t.co/hzY4vQ09xM',
 'BBC News - Climate change: CO2 emissions rising for first time in four years https://t.co/cxrzCQmmml',
 "Ooh LOL. Anderson Cooper - giving Trump's ignorance a whipping. https://t.co/wbivZx8YHX",
 '#NCA2018 makes it abundantly clear: the U.S. must do more to limit carbon emissions and combat the effects of clima… https://t.co/Xc81JouHQb']

In the next lesson, you will explore calculating word frequencies associated with tweets using Python.

Leave a Comment