After completing this tutorial, you will be able to:
- Connect to the twitter RESTful API to access twitter data with
- Generate custom queries that download tweet data into
- Access tweet metadata including users in
What You Need
You will need a computer with internet access to complete this lesson.
In this lesson, you will explore analyzing social media data accessed from Twitter using Python. You will use the Twitter RESTful API to access data about both Twitter users and what they are tweeting about.
To get started, you’ll need to do the following things:
- Set up a Twitter account if you don’t have one already.
- Using your Twitter account, you will need to apply for Developer Access and then create an application that will generate the API credentials that you will use to access Twitter from
- Import the
Once you’ve done these things, you are ready to begin querying Twitter’s API to see what you can learn about tweets!
Set up Twitter App
After you have applied for Developer Access, you can create an application in Twitter that you can use to access tweets. Make sure you already have a Twitter account.
To create your application, you can follow a useful tutorial from
rtweet, which includes a section on Create an application that is not specific to R:
NOTE: you will need to provide a phone number that can receive text messages (e.g. mobile or Google phone number) to Twitter to verify your use of the API.
Access Twitter API in Python
Once you have your Twitter app set-up, you are ready to access tweets in
Python. Begin by importing the necessary
import os import tweepy as tw import pandas as pd
To access the Twitter API, you will need 4 things from the your Twitter App page. These keys are located in your Twitter app settings in the
Keys and Access Tokens tab.
- consumer key
- consumer seceret key
- access token key
- access token secret key
Do not share these with anyone else because these values are specific to your app.
First you will need define your keys
consumer_key= 'yourkeyhere' consumer_secret= 'yourkeyhere' access_token= 'yourkeyhere' access_token_secret= 'yourkeyhere'
auth = tw.OAuthHandler(consumer_key, consumer_secret) auth.set_access_token(access_token, access_token_secret) api = tw.API(auth, wait_on_rate_limit=True)
Send a Tweet
You can send tweets using your API access. Note that your tweet needs to be 280 characters or less.
# Post a tweet from Python api.update_status("Look, I'm tweeting from #Python in my #earthanalytics class! @EarthLabCU") # Your tweet has been posted!
Search Twitter for Tweets
Now you are ready to search Twitter for recent tweets! Start by finding recent tweets that use the
#wildfires hashtag. You will use the
.Cursor method to get an object containing tweets containing the hashtag
To create this query, you will define the:
- Search term - in this case
- the start date of your search
Remember that the Twitter API only allows you to access the past few weeks of tweets, so you cannot dig into the history too far.
# Define the search term and the date_since date as variables search_words = "#wildfires" date_since = "2018-11-16"
Below you use
.Cursor() to search twitter for tweets containing the search term #wildfires. You can restrict the number of tweets returned by specifying a number in the
.items(5) will return 5 of the most recent tweets.
# Collect tweets tweets = tw.Cursor(api.search, q=search_words, lang="en", since=date_since).items(5) tweets
<tweepy.cursor.ItemIterator at 0x7fafc296e400>
.Cursor() returns an object that you can iterate or loop over to access the data collected. Each item in the iterator has various attributes that you can access to get information about each tweet including:
- the text of the tweet
- who sent the tweet
- the date the tweet was sent
and more. The code below loops through the object and prints the text associated with each tweet.
# Collect tweets tweets = tw.Cursor(api.search, q=search_words, lang="en", since=date_since).items(5) # Iterate and print tweets for tweet in tweets: print(tweet.text)
2/2 provide forest products to local mills, provide jobs to local communities, and improve the ecological health of… https://t.co/XemzXvyPyX 1/2 Obama's Forest Service Chief in 2015 -->"Treating these acres through commercial thinning, hazardous fuels remo… https://t.co/01obvjezQW RT @EnviroEdgeNews: US-#Volunteers care for abandoned #pets found in #California #wildfires; #Dogs, #cats, [#horses], livestock get care an… RT @FairWarningNews: The wildfires that ravaged CA have been contained, but the health impacts from the resulting air pollution will be sev… RT @chiarabtownley: If you know anybody who has been affected by the wildfires, please refer them to @awarenow_io It is one of the companie…
The above approach uses a standard for loop. However, this is an excellent place to use a Python list comprehension. A list comprehension provides an efficient way to collect object elements contained within an iterator as a list.
# Collect tweets tweets = tw.Cursor(api.search, q=search_words, lang="en", since=date_since).items(5) # Collect a list of tweets [tweet.text for tweet in tweets]
['Expert insight on how #wildfires impact our environment: https://t.co/sHg6PcC3R3', 'Lomakatsi crews join the firefight: \n\n#wildfires #smoke #firefighter\n\nhttps://t.co/DcI2uvmKQv', 'RT @rpallanuk: Current @PHE_uk #climate extremes bulletin: #Arctic #wildfires & Greenland melt, #drought in Australia/NSW; #flooding+#droug…', "RT @witzshared: And yet the lies continue. Can't trust a corporation this deaf dumb and blind -- PG&E tells court deferred #Maintenance did…", 'The #wildfires have consumed an area twice the size of Connecticut, and their smoke is reaching Seattle. Russia isn… https://t.co/SgoF6tds1s']
To Keep or Remove Retweets
A retweet is when someone shares someone else’s tweet. It is similar to sharing in Facebook. Sometimes you may want to remove retweets as they contain duplicate content that might skew your analysis if you are only looking at word frequency. Other times, you may want to keep retweets.
Below you ignore all retweets by adding
-filter:retweets to your query. The Twitter API documentation has information on other ways to customize your queries.
new_search = search_words + " -filter:retweets" new_search
tweets = tw.Cursor(api.search, q=new_search, lang="en", since=date_since).items(5) [tweet.text for tweet in tweets]
['@HARRISFAULKNER over 10% of a entire state (#Oregon) has been displaced due to #wildfires which is unprecedented, a… https://t.co/SJPyDw2vGZ', 'I left a small window open last night and the smoke from the outside #wildfires made our smoke alarm go off at 4 am… https://t.co/qj79wtXZ7o', '5 of the 10 biggest #wildfires in California history are burning right now.\n\nFossil fuels brought the… https://t.co/BqRZvnj7Ir', '#Wildfires are part of a vicious cycle: their #emissions fuel global heating, leading to ever-worse fires, which re… https://t.co/OA4UZoFbn8', 'This could be helpful if you need to evacuate!\n#wildfires #OregonIsBurning https://t.co/7FuHOPf4th']
Who is Tweeting About Wildfires?
You can access a wealth of information associated with each tweet. Below is an example of accessing the users who are sending the tweets related to #wildfires and their locations. Note that user locations are manually entered into Twitter by the user. Thus, you will see a lot of variation in the format of this value.
tweet.user.screen_nameprovides the user’s twitter handle associated with each tweet.
tweet.user.locationprovides the user’s provided location.
You can experiment with other items available within each tweet by typing
tweet. and using the tab button to see all of the available attributes stored.
tweets = tw.Cursor(api.search, q=new_search, lang="en", since=date_since).items(5) users_locs = [[tweet.user.screen_name, tweet.user.location] for tweet in tweets] users_locs
[['J___D___B', 'United States'], ['KelliAgodon', 'S E A T T L E ☮ 🏳️\u200d🌈'], ['jpmckinnie', 'Los Angeles, CA'], ['jxnova', 'Harlem, USA'], ['momtifa', 'Portland, Oregon, USA']]
Pandas Dataframe From A List of Tweet Data
One you have a list of items that you wish to work with, you can create a pandas dataframe that contains that data.
tweet_text = pd.DataFrame(data=users_locs, columns=['user', "location"]) tweet_text
|1||KelliAgodon||S E A T T L E ☮ 🏳️🌈|
|2||jpmckinnie||Los Angeles, CA|
|4||momtifa||Portland, Oregon, USA|
Customizing Twitter Queries
As mentioned above, you can customize your Twitter search queries by following the Twitter API documentation.
For instance, if you search for
climate+change, Twitter will return all tweets that contain both of those words (in a row) in each tweet.
Note that the code below creates a list that can be queried using Python indexing to return the first five tweets.
new_search = "climate+change -filter:retweets" tweets = tw.Cursor(api.search, q=new_search, lang="en", since='2018-04-23').items(1000) all_tweets = [tweet.text for tweet in tweets] all_tweets[:5]
['They care so much for these bears, but climate change is altering their relationship with them. It’s getting so dan… https://t.co/D4wLNhhsdt', 'Prediction any celebrity/person in government that preaches about climate change probably is blackmailed… https://t.co/TM64QukGhy', '@RichardBurgon Brain washed and trying to do the same to others. Capitalism is ALL that "Climate Change" is about. https://t.co/GbNE87luVx', "We're in a climate crisis, but Canada's handing out billions to fossil fuel companies. Click to change this:… https://t.co/oQZXUfOWe8", 'Hundreds Of Starved Reindeer Found Dead In Norway, Climate Change Blamed - Forbes #nordic #norway https://t.co/9XLS8yi72l']
In the next lesson, you will explore calculating word frequencies associated with tweets using