Donate
CC BY NC SA 4.0 (Unless specified)
Among all social media platform native APIs, Twitter is among the most generous. As a contrary example, Facebook only allows 200 queries per user per hour (although this could be exploited a bit), and there is no streaming API allowed. Among Twitter API, streaming API is perhaps the one that could provide the greatest amount & the most useful data to researchers. We shall visit the basics of the API, how to implement a streaming tool, and the pros and cons of the streaming API.
Introduction to Twitter Streaming API
Twitter Streaming API, as the name suggests, could give you a stream of tweets happening in real time. The official endpoint is named Filter Realtime Tweets, indicating the realtime nature and the ability to filter the data researchers want.
There are two versions of Streaming API, one on the Standard/Premium v1.1 Twitter API, another on Standard/Premium/Enterprise v2.0. Since the v2.0 API allows a monthly cap of 500,000 tweets pulled, the v1.1 API would be the most suitable for researchers who want data in a scale of hundreds of millions, so that’s the one the we will be discussing in this article.
Alternatives to Twitter API
Although Twitter gives a great amount of leeway to researchers and those who want to do social engineering research, Twitter still has a great amount of limitations on what data we are allowed to collect. For example, no historical data over 7 days are allowed to be automatically retrieved in bulk, unless a great amount of money is paid (for premium access it could amount to 2,000 USD per month for 5000 queries per month,) legacy data API only allows a small amount of retrieval, some amount to 90,000 Tweets / 15 minutes (about 8M tweets per day, although this sounds like a lot, you first need to acquire the tweet ID you would like to acquire. This is a process called Hydration, which is usually used when using public twitter datasets). To be able to maximize the amount of data you could acquire from Twitter, you would need an un-official scraper(TWINT for example), which could be very volatile and requires quite amount of technical knowledge to be able to pull off, and also a proxy pool (due to Twitter detections on bulk scrapping attempts).
Using the Streaming API
In order to begin your journey on the Streaming API, I prepared a python example for you to start on (full application here).
First things first, the dependencies we needed. We need a python adaptation of the Twitter API, namely Tweepy; we need a way to store all the data we pulled off twitter, we use Sqlite3 here as an example, but if you want to drag down all the data without any abbreviations, I would recommend Json.
Alongside with some utility functions, like giving your output file a timestamp, being able to store your file in directories using os, and a Dataframe to temporarily store you data by pandas, and a small email function using smtplib to send emails to you when the streamer stopped unexpectedly (because it will be running on its own for a LONNG time) we would have this:
import pandas as pd
import tweepy
import os
import sys
from datetime import datetime
from urllib3.exceptions import ProtocolError
import smtplib
import ssl
import sqlite3
Next step is to construct a listener using tweepy.StreamListener:
class MyStreamListener(tweepy.StreamListener):
def on_status(self, status): #This method override the original method on_status
if hasattr(status, 'retweeted_status'): # you can ignore ignore retweets if you want
#sys.stdout.write("Retweet, Ignore \n")
return
if not status.lang == "en": # You can ignore language here, or you can do it before calling the streamer
sys.stdout.write("Not English, Ignore \n")
return
global tweetcount
global tweetlist
tweetcount += 1
tweetdict = {
"Tweet_ID": str(status.id),
"Content": status.text,
"Created_At": str(status.created_at),
"Username": status.user.screen_name,
"User_ID": status.user.id_str,
"User_Description": status.user.description,
"User_Followers_count": str(status.user.followers_count),
"Entities": str(status.entities),
"Source": status.source,
"Source_URL": status.source_url,
"Geo": str(status.geo),
"Coordinates": str(status.coordinates),
#"Sensitive": status.possibly_sensitive,
"Lang": status.lang
} #This only consist a part of what you could put into database, refer to Tweepy API doc for more.
tweetlist.append(tweetdict)
sys.stdout.write("Getting tweet No. " + str(tweetcount) + "\n")
if(tweetcount == 5000): # Data is stored into sqlite3 database, each file containing 5,000 tweets, with a timestamp attached to the file name
tweetcount = 0
if not os.path.exists("stream"):
os.mkdir("stream")
df_result = pd.DataFrame(tweetlist)
timestamp = datetime.now()
timestamp_str = timestamp.strftime("%Y_%m_%d_%H_%M_%S")
filename = os.path.join("stream", timestamp_str + "_data.db")
# The file would be stored in /stream/YYYY_MM_DD_HH_MM_SS_data.db
conn = sqlite3.connect(filename)
df_result.to_sql('tweets', conn, if_exists='append', index = None)
tweetlist = []
def on_error(self, status_code):
if status_code == 420: # If a 420 flow limit is seen, the streamer is stopped.
return False
print("Continue with status_code:" + str(status_code))
return True
And we can also add a bit of an email function to the whole thing using try/catch:
try:
#Your function here
except Exception as e: #The following method could send an email if the program exits unexpectedly, so that you could come back to check.
print("An Exception has occured:")
print(e)
smtp_server = "smtp.gmail.com"
port = 587
sender_email = ""
receiver_email = ""
message = """\
Subject: Hi There \n
The program of tweet collecting has stopped.\n
This message is sent from Python.
"""
password = "" # password of the sender email account.
context = ssl.create_default_context()
try:
server = smtplib.SMTP(smtp_server, port)
server.ehlo()
server.starttls(context = context)
server.ehlo()
server.login(sender_email, password)
server.sendmail(sender_email, receiver_email, message)
except Exception as e:
print(e)
finally:
server.quit()