Twitter Crawler

Introduction

Our Twitter Crawler is an internal tool designed to accomplish two goals:

  • To help the SGA Communications team assist people in the GT community by responding to Twitter posts that SGA members can be of help with
  • To help SGA leadership passively assess student perception on topics around the GT campus in order to better serve the student body

Project Design

The Twitter Crawler uses Python’s Tweepy library to listen for tweets that match a list of keywords, then forwards links for matching tweets to a Slack channel containing the Vice President of Communications, the Vice President of IT, Chief of Staff, President, EVP, the application owner, two selected UHR representatives, and internal developers.

Data Protection

Only links to tweets and the matching keyword are posted. The designated Slack channel contains only logs of the ID’s associated with each tweet posted as well as its matching keyword. This will preserve the integrity of tweet content in the case that tweets are deleted.

Previews of links are blocked for any link under the twitter.com domain. This ensures that Slack cannot cache data from Twitter, meaning that SGA members can never access the content of a Twitter post through the designated Slack channel if the Twitter post is not publicly accessible.

Lastly, because Twitter’s server only requires a valid tweet ID in order to fetch a tweet, the username associated with each tweet sent to our designated Slack channel is also masked.

As a demonstration, notice that clicking the following two links both forward a user to the same page:

The latter of the two links above is the format in which tweets are forwarded to our designated Slack channel.

Open Source Standards

All source code, including a list of keywords used to scrape tweets, excluding any data files containing keys, tokens, or private endpoints, is accessible to all GT users. The source code is contained at: https://github.gatech.edu/gt-sga-it/twitter-crawler.

User Selection

Only Twitter users that follow the SGA account and whose accounts are public are eligible to have their content forwarded to the SGA Slack workspace. An opt-out form can be accessed here. After an opt-out form is submitted, the user’s tweets will never be forwarded to an SGA workspace. SGA’s IT Board uses Twitter’s API to map usernames (@’s) to profile ID numbers in order to ensure that an account remains ineligible even if an account’s username (@) changes at a later time. This process will soon be automated using the Qualtrics API to extract responses and feed them into a list of users to filter out when listening to tweets from SGA followers.