The Data Awakens
It has come. The most anticipated movie of 2015 was released and the Force has awakened. We have tracked the activity on Twitter before and after the release date to gain insight into the reactions of people and their feelings about the latest episode of the most famous movie franchise in history.
Data:
We have collected the stream of Twitter data containing search terms and hashtags related to Star Wars: The Force Awakens through the TwitterAPI. The data had been collected between 4th and 29th of December 2015 (with world premiere being held on Dec 17th). All together more than 10 million tweets were collected, with ~2.5% containing geolocation either in form of direct coordinates or human readable location (e.g. New York). Due to technical issue we missed around 10-hour window on Dec 19th, which was excluded from the analysis.
Traffic:
First thing we looked at was the frequency of Star Wars related tweets in time. It is clearly visible that most of the tweets came from US and UK, which can be easily explained by popularity of Twitter itself in these countries. Next thing to see is the periodicity of day and night, where people tweet more at night than during the day. Also the timezone shift is clearly visible.
More interestingly, we can see the build up before the release, as the number of tweets is increasing for a few days before the world premiere and sky rocketing on this day.
Sentiment analysis:
One thing is to see that people tweet, but what do they think? To look at this problem we used a sentiment analysis model, which assigned each tweet a score between -1 and 1 (-1 being a total hater, 1 someone willing to die for the movie). First we plotted the results in hexbin map, visualizing the sentiment in the world split in little hexagons (aggregating by mean within the cell).
There are clearly visible small areas of very strong and positive opinions in Canada and northern and eastern Europe. Also apparently people did not like Star Wars (on average) very much in South America, Turkey, Japan and Indonesia. Trends are much less observable in US, UK and Western Europe, where the distribution of sentiments approaches overall distribution (discussed later) much more closely because of the high number of tweets.
Let's take a look athow the sentiment evolved with number of tweets over the course of time. While the number of tweets kind of looks like a Gaussian curve (keep in mind the missing tweets for the 19th Dec as mentioned earlier), the average sentiment shows a steady decline as the time passes. There is an observable dip on the day of world premiere but sentiments keep steadily low the whole time.
Sentiments by part of the world:
Last but not least, we have looked at the distributions of tweet sentiments by parts of the world. We make the comparison visually using the histograms and also statistically using 2 sample Kolmogorov- Smirnov test. In each case we compared tweets from specific areas to the rest of the world.
We have compared four arbitrarily selected areas (South America, USA, Europe, Asia). In the first three, the difference in distributions was confirmed both visually and with statistical test. The results are that South America's sentiments about The Force Awakens are significantly more negative than the rest of the world (as can be seen also in the hexbin map above). People from the US and Europeans liked the new Star Wars significantly more than the rest of the world's population. We obtained a tricky result from Asia, were one bin (sentiment between -0.6 and -0.5) is much more frequent and therefore distributions were significantly different because of the nature of KS test. However, the visualization suggest opposite relationship and you are welcome to discuss in comments on why that is the case :).
See all the visualizations below.
South America
KS test p-value: 0.0
USA
KS test p-value: 0.0
EUROPE
KS test p-value: 0.0
Asia
KS test p-value: 0.0
Star Wars: The Force Awakens seems to have been a great success according to all the hype around it and it definitely made another generation addicted to the franchise. Why Twitter doesn't share the optimism?
There can be several reasons:
-
There is inherent sample bias with working with social network data, as we only have data from people, who decide to share. These are usually the ones with stronger opinions,
-
we were pulling English tweets (due to sentiment modeling and tools available for NLP) and while not that obvious, this can also create a sampling bias, especially in non-English speaking countries.
-
and lastly, there was valid criticism of the movie and maybe people expected more.
If you have any more ideas, please leave a comment.
The Force and data are very similar. They are present everywhere, can do a lot of good and almost magic-like things and definitely also have a dark side. Luckily there are Knights of Knoyd who can help you.
May the Data be with you!