A map of World Cup stadia using wikidata

Wikidata is an amazing project that aims to turn the unstructured text of Wikipedia into a database of facts and figures that allows you to go beyond just presenting a page about something to using data about it.

I've been wanting to try out using it, and "SPARQL", the language used to query it, so I decided to try and create a map of every stadium that has hosted a game at the Fifa World Cup finals - a topical query as the 2018 World Cup in Russia has just started.

Step 1. Querying the data

I used query.wikidata.org to come up with a query that got me the data I was looking for. Having never used SPARQL before it took a bit of tweaking to get the query I needed - I found the interface helpful for finding the right entities and the included examples for how to structure it.

Here's the query I came up with. I'll go through what each part does below.

wc_sparql = """
SELECT ?FIFA_World_CupLabel ?location ?locationLabel ?coord ?countryLabel WHERE {
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
  ?FIFA_World_Cup wdt:P3450 wd:Q19317.
  ?FIFA_World_Cup wdt:P276 ?location.
  ?location wdt:P625 ?coord.
  ?location wdt:P17 ?country
}
ORDER BY ?FIFA_World_CupLabel
"""

The first part sets up the fields we want to return - the name of the World Cup, the location ID (a stadium), the name of the stadium, the latitude and longitude and the name of the country

SELECT ?FIFA_World_CupLabel ?location ?locationLabel ?coord ?countryLabel WHERE {

This next part allows you to fetch labels for each of the items, which is more helpful than the URI that gets returned.

SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }

Then we start off by adding a field called "FIFA_World_Cup" based on finding "sports season of league or competition" (wdt:P3450) with the labels "FIFA World Cup" (wd:Q19317)

?FIFA_World_Cup wdt:P3450 wd:Q19317.

Then we look for the locations (wdt:P276) attached to each of these competitions:

?FIFA_World_Cup wdt:P276 ?location.

And for each location we want the co-ordinates (wdt:P625) and country (wdt:P17).

?location wdt:P625 ?coord.
?location wdt:P17 ?country

I then used a python library called SPARQLWrapper to send the query to the WikiData sparql endpoint, and get JSON data back.

from SPARQLWrapper import SPARQLWrapper, JSON
sparql = SPARQLWrapper("https://query.wikidata.org/sparql")
sparql.setQuery(wc_sparql)
sparql.setReturnFormat(JSON)
results = sparql.query().convert()

Here's an example of what one of the results looks like - a stadium used in the very first World Cup in Uruguay.

results['results']['bindings'][0]
{'FIFA_World_CupLabel': {'type': 'literal',
  'value': '1930 FIFA World Cup',
  'xml:lang': 'en'},
 'coord': {'datatype': 'http://www.opengis.net/ont/geosparql#wktLiteral',
  'type': 'literal',
  'value': 'Point(-56.152778 -34.894444)'},
 'countryLabel': {'type': 'literal', 'value': 'Uruguay', 'xml:lang': 'en'},
 'location': {'type': 'uri',
  'value': 'http://www.wikidata.org/entity/Q498245'},
 'locationLabel': {'type': 'literal',
  'value': 'Estadio Centenario',
  'xml:lang': 'en'}}

Step 2: processing the results

I then want to turn the results into nicely formatted data for plotting on a map. I'm looking for data that contains one record for each stadium, even if it has hosted games at more than one World Cup (e.g Mexico in 1970 and 1986).

The co-ordinates for each location come in WKT format, so I use a library called Shapely to extract the latitude and longitude.

# for converting coordinates
import shapely.wkt

Then I go through each of the results and add to a python dictionary. If the stadium is already in the dictionary I just add the extra World Cup year to the dictionary, rather than adding a new record.

stadia = {}
for result in results["results"]["bindings"]:
    stadium_id = result["location"]["value"]
    worldcup = result["FIFA_World_CupLabel"]["value"].replace(" FIFA World Cup","")
    if stadium_id in stadia:
        stadia[stadium_id]["worldcups"].append(worldcup)
    else:
        stadia[stadium_id] = {
            "lat_lng": shapely.wkt.loads(result["coord"]["value"]).coords[0],
            "worldcups": [worldcup],
            "stadium": result["locationLabel"]["value"],
            "country": result["countryLabel"]["value"],
        }

Here's what an entry in the processed data looks like. I've used the wikidata URI as an identifier for each stadium.

stadia['http://www.wikidata.org/entity/Q498245']
{'country': 'Uruguay',
 'lat_lng': (-56.152778, -34.894444),
 'stadium': 'Estadio Centenario',
 'worldcups': ['1930']}

Step 3: Mapping the results

I really like folium for easily producing Leaflet-based maps in python. I'm going to also use the MarkerCluster plugin to cluster the markers to make it easier to view all the stadia on one map - with clusters based on countries.

import folium
from folium.plugins import MarkerCluster
import html

First I initialise the map and zoom out so you can see the whole world.

m = folium.Map(
    location=[20,0],
    zoom_start=2,
    tiles='Stamen Toner',
    attr='''<a id="home-link" target="_top" href="../">Map tiles</a> by 
    <a target="_top" href="http://stamen.com">Stamen Design</a>, 
    under <a target="_top" href="http://creativecommons.org/licenses/by/3.0">CC BY 3.0</a>. 
    Data by <a target="_top" href="http://openstreetmap.org">OpenStreetMap</a>, 
    under <a target="_top" href="http://creativecommons.org/licenses/by-sa/3.0">CC BY SA</a>.
    | Locations powered by <a href="https://query.wikidata.org/">Wikidata</a>.'''
)

Then we go through the stadia and add each one to a cluster based on its country. I've also added a little popup which tells you the stadium's name and which World Cups it hosted games at. I also set a football icon for the pins.

clusters = {}
for stadium_id in stadia:

    s = stadia[stadium_id]

    if s["country"] not in clusters:
        clusters[s["country"]] = MarkerCluster().add_to(m)

    folium.Marker(
        [s["lat_lng"][1], s["lat_lng"][0]], 
        popup='{}, {} - <i>{}</i>'.format(
            html.escape(s["stadium"]), 
            html.escape(s["country"]),
            html.escape(", ".join(s["worldcups"]))
        ),
        icon=folium.Icon(icon='soccer-ball-o', prefix='fa')
    ).add_to(clusters[s["country"]])

Finally we show the resulting map, which can be zoomed and panned to look at particular countries.

m

As an extra I wanted to convert the data into GeoJSON format so it's easy to use elsewhere.

from geojson import Feature, Point, FeatureCollection
wc_geojson = FeatureCollection(
    [Feature(geometry=Point(stadia[s]["lat_lng"]), 
             properties=stadia[s]) for s in stadia]
)
with open('world_cup_stadia.geojson', 'w') as a:
    geojson.dump(wc_geojson, a, indent=4)

Step 4: taking it further

This was just a quick exercise to try and get data out of wikidata and then use it. There's a few things that could be done to take it further:

  • add filters to the map to filter by country, World Cup, etc.
  • see if Wikidata has data on the matches that took place at each location and the teams that have played there, allowing you to filter by team or stage of the competition.
  • visualise the data by adding in details like the maximum attendance

Acknowledgements

Wikidata stamp