Topic Analysis of Twitter Profiles
This blog is an exercise to use two different APIs for collecting and analysing tweets.
To pull the tweets we will use the tweepy library which provides a python wrapper for the API as provided by Twitter. In order to use it you will have to create a free twitter developer account and create an application in order to get the credentials.
To analyse the content of a tweet and extract the media topics from it we will use the TextRazor API which also has a free plan including 500 calls/day.
Step 1: Pulling the tweets
Let's pull the most recent 600 tweets from BBCWorld and from basecamp_ai and compare their topics. From each tweet we will extract the text, the shared url, and then convert the results to a pandas dataframe.
import tweepy
import os
tweets = {}
auth = tweepy.OAuthHandler(os.environ['TWITTER_CONSUMER_KEY'],
os.environ['TWITTER_CONSUMER_SECRET'])
auth.set_access_token(os.environ['TWITTER_ACCESS_KEY'],
os.environ['TWITTER_ACCESS_SECRET'])
api = tweepy.API(auth)
screen_name = 'basecamp_ai'
tweets_per_page = 200
num_pages = 3
for res in (tweepy.Cursor(api.user_timeline,
id=screen_name,
count=tweets_per_page)
.pages(num_pages)):
if len(res) > 0:
for r in res:
tweet = {}
tweet['id'] = r.id
tweet['published_at'] = r.created_at
tweet['content'] = r.text
if len(r.entities['urls']) > 0:
url = r.entities['urls'][0]['expanded_url']
if 'twitter.com' not in url:
tweet['shared_url'] = url
else:
tweet['shared_url'] = None
else:
tweet['shared_url'] = None
tweets[tweet['id']] = tweet
tweets_df = pd.DataFrame.from_dict(tweets, orient='index')
Step 2: Extracting the topics
Out of the 600 basecamp_ai tweets 572 contain at least one shared url. This is great for topic analysis since the tweet text itself is too short to extract topics reliably. Luckily you can also extract topics from urls directly using the TextRazor API. Go to www.textrazor.com and create an account to get your API key. For each url the API will usually return a number of media topics with the corresponding scores.
results = []
textrazor.api_key = os.environ['TEXTRAZOR_API_KEY']
client = textrazor.TextRazor(extractors=["topics"])
client.set_classifiers(['textrazor_mediatopics'])
category_names = {'-1': 'no topics discovered',
'-2': 'no shared urls',}
for tweet in basecamp_tweets.itertuples():
url = tweet.shared_url
labels = []
if url:
response = client.analyze_url(url)
for c in response.categories():
labels.append((c.category_id, c.score))
if c.category_id not in category_names:
category_names[c.category_id] = c.label
if len(labels) == 0:
labels = [('-1', 0)] # no topics discovered
else:
labels = [('-2', 0)] # no shared urls
for subject_code, score in labels:
results.append({'tweet_id': tweet.id,
'content': tweet.content,
'shared_url': url,
'subject_code': subject_code,
'score': score,
'topic_name': category_names[subject_code]})
basecamp_topics = pd.DataFrame.from_records(results)
The topic structure is tree like and can be found here. That means for one url we can get the topic "science and technology" with a score of 0.3465 and "science and technology>social sciences>geography" with a score of 0.4041.
Before reaching the daily limit I could extract 3964 topics from 500 urls. The full topic labels are in the form "science and technology>social sciences>psychology". From that I will create 2 additional columns in the dataframe with 1st and 2nd level topic.
basecamp_topics['1st_level_topic'] = basecamp_topics['topic_name'].apply(lambda x: x.split('>')[0])
basecamp_topics['2nd_level_topic'] = basecamp_topics['topic_name'].apply(lambda x: '>'.join(x.split('>')[0:2]))
Let's see which topics are most common.
basecamp_topics['topic_name'].value_counts()
No surprises here. As expected from a data science bootcamp most tweets are about computer sciences, software and mathematics.
economy, business and finance>economic sector>computing and information technology 418
science and technology 380
science and technology>technology and engineering>IT/computer sciences 366
economy, business and finance>economic sector>computing and information technology>software 319
science and technology>technology and engineering 241
science and technology>mathematics 224
In the next step I would like to calculate the average score per 1st level topic for each tweet and then sum those score for each 1st level topic.
basecamp_topic_profile = (basecamp_topics.groupby(['tweet_id','1st_level_topic'])
.agg({'score': 'mean'})
.groupby('1st_level_topic')
.agg({'score': 'sum'}))
basecamp_ai topic profile
I was a little bit surprised that "economy, business and finance" came before "science and technology". A possible explanation could be that there a lot of tweets mentioning data science and AI companies and startups. And maybe this 1st level topic is more easily recognised with high confidence (score) than others.
And indeed checking the numbers revealed that "science and technology" has been detected in 440 tweets and "economy, business and finance" in 437 tweets.
Number of tweets per 1st level topic
The topics BBCWorld show a more evenly distributed profile with the focus on politics and economy.
Topic profile of BBCWorld
Step 3: Your Turn
This short exercise shows how easily you can perform such a complex task as topic classification using the right API.
Try it on your own twitter profile and share your results in the comments :-)