A WORLD THROUGH THE NEWS

APPLIED DATA ANALYSIS @EPFL

by Bryan ABATE, Grégoire CLEMENT and Maxime DELISLE

INTRODUCTION

Brace yourself

The aim of this project is to use the data provided by The GDELT Project to visualize the world through the way the news are reported.

The GDELT Project monitors broadcasts and web news from nearly every country in the world, extracts the metadata such as the location, the actors and the tone used by the source of information, and then stores it.

Using data provided by GDELT, we try for each event to localize the country in which the event happened, the “target country”, we will do the same to find which country the source that wrote the article is from, the “source country”. Using these information with the average tone corresponding to the articles reporting the events, we can get small “windows” on the way a person, or a group of people from the “source country”, view something happening in the “target country”.

In this project, by putting together those windows, we will show what can be seen through them.

The source code is available here.

DATA PROCESSING

Let’s get dirty

From the raw data of GDELT, the main challenge is finding the “source country” from a website URL.

Our first idea is to look at the TLD (top level domain) of each URL and match the source with the corresponding country. This is easy to implement, however, most of the event sources are still not matched because the TLD of their URL does not correspond to a specific country (.com, .org, and so on).

Since this idea is not sufficient on its own, we look online for a mapping from website to country that would have already been done. After finding, scraping and trying on a few mapping, we decide to use the data provided by abyznewslinks which is the more up to date and complete referencement we could find.

Unfortunately, even by combining the TLD matching and the mapping, around 20% of the sources are left unmatched.

This plot shows the frequency of each country as a source in our dataset. We can see that a bit more than 20% of the sources have no country matched.

Upon inspection, we find that the unmatched sources are the one that are not affiliated with a specific country and have a more international root. Therefore, we decide to discard them because we are interested in the news we can attribute a country of origin. Note also that included in the unmatched sources are some broadcast which are obviously not website, as it is an extremely small portion, we discard them as well.

This plot shows which unmatched websites are the most frequent. Most of those websites are Arabic and represent a group of countries rather than one.

Almost no operation needs to be done to get the “target country” because it is contained in the raw data. However the data are in the FIPS 10-4 standard and this standard is not common in GeoJson files, hence we convert it to the ISO 3166-1 alpha-3 standard, since this standard is especially made for countries.

EXPLORATION

Indianana Jones would be proud

One of the features present in the data is the Goldstein scale. The Goldstein scale captures the impact that a type of event will have on the stability of the country the event happened in.

From this feature, we group the events by the country they happened in and take the average Goldstein score as the aggregate. According to GDELT, the aggregate gives us the stability of a country over the corresponding time.

However, we could argue that two events of the same type do not necessarily have the same potential impact and that some ponderation needs to be done for our estimation to be accurate (for example, a riot with 10 people will receive the same score as another riot with 10,000 people cf.: GDELT documentation). One way to ponderate each event is to take into account the number of time it has been mentioned during the first 15 minutes, another feature given by GDELT.

The choropleth map above shows the average pondered Goldstein score for each country per month. A darker country means it is more unstable than other lighter countries over the selected period of time.

Another feature that contains the same information is the average tone employed in the sources mentioning the event. Since the proportion of events with more than one source is around 3%, we will consider that those events are negligible and that this feature is the tone of the article written by someone living in the source country.

The choropleth map above shows the average tone employed to relate the events happening in each country per month. A darker country means the tone employed was more negative than for the lighter ones.

The scale varies but proportions between the countries remains the same. For this reason, it is safe to say that this feature could also be used to express the stability of a particular country.

One difference, however, is that the average tone contains information on how the event was reported. This information can be considered as a bias the writer has toward or against the news, where the event happens or its type.

When analyzing anything, we usually aim to reduce the bias to a minimum, but here we will try to use this bias in our favor to show differences between countries.

ANALYSIS

Let the magic happen

The bias we will focus on is the bias a writer from a source country X will have toward or against events happening in a target country Y.

By filtering the events so that we only keep the events happening in one target country, we can then group by the source countries and keep for each group the average “tone” as the aggregate.

On this plot, we can see the average tone employed to talk about events happening in the USA per country source and per month. We can see that the media uses mostly a negative tone regarding events in the US.

This shows us the average tone used by the media per country on the events happening in the selected country over a chosen lapse of time. This could be done multiple times at different intervals to observe an evolution.

However, it can be difficult to interpret the trends directly from this map because most of the changes do not come from an evolution of the bias, but from the change in the proportion of “good” and “bad” events. This can be observed by plotting the average “tone” over time for multiple countries where we can clearly see trends.

This plot shows the evolution over time of the average tone employed to talk about the event happening in the USA by news from France, Switzerland, and Mexico. We can see trends in the evolution of the tone among the countries.

As we are only interested in the bias, we would like to remove the common trend between the countries. One way to do so would be to remove the average tone of the articles written about the events happening in the target country.

This method removes two things, the information coming from the proportion of “good” and “bad” events and the average bias the countries have toward or against the selected country. It enables us to get the bias coming from the source countries only.

This plot shows the evolution of the bias per country source regarding events in the USA. Here we removed the common trend by subtracting the average tone used by every source country writing about events happening in the USA.

Now, what if we could classify countries depending on their opinion on each other?

First, we find for each source country the polynomial basis that approximate its average tone evolution toward/against the selected country.

This plot shows how the approximation looks like. We aim to get the general behavior of the curve and not overfit it.

The aim here is to capture the trend of each evolution of tone, but not to completely overfit it. Using this approximation we use its weights as a set of features for each country. Those features can be used to view the countries in a high dimensional space.

From this space, we can apply clustering algorithms to group countries together. Those groups can be interpreted as source countries that have a similar evolution of tone regarding the selected country.

The choropleth maps show how the clustering algorithm, a Spectral clustering, divides the countries depending on which country we focus on for the bias. We can clearly see a separation between the US and Russia. Cuba, most of South America and most of Asia is on Russia’s side whereas the countries on the US’s side aren’t constants.

Now, what if we pushed this even further?

We grouped countries depending on how they view the events happening in a selected country, but we could do the same thing multiple times with a different selected country and each time keep the weights of the corresponding source country.

When applying the same cluster algorithms, this is what we get:

This choropleth shows how the clustering algorithm, a Spectral clustering, clusters the countries when we take into account the bias per country toward the USA, Russia, Ukraine, and France altogether. We can see a similar separation as on the choropleths above, only much clearer with on one side the “East”, and on the other, the “West”.

This plot shows the silhouette score of the clustering above. This silhouette score tells us that two cluster is optimal for our data. As a rule of thumb, a score above 0.5 can be considered as significant. This also tells us that it wouldn’t mean anything to compute for more than two clusters

CONCLUSION

Already ?

In conclusion, we showed how we can be defined by the bias we have toward or against other people or things (in our case, countries) and how it is possible to create groups or clusters depending on those biases and how they evolve over time. As a further work, we could imagine reusing the same approach to compute anyone’s bias and maybe extract more precise data from this, such as linguistic data, religious information or else.

AUTHORS

Bryan ABATE

Bryan ABATE

Master in Computer Science

Grégoire CLEMENT

Grégoire CLEMENT

Master in Data Science

Maxime DELISLE

Maxime DELISLE

Master in Data Science