If something doesn't work, you can report a problem.
This notebook presents a step-by-step guide to retrieve names of places mentionned in the Impresso corpus and plot them on a map for a given search query.
To access the data, you will need an Impresso account. If you do not have one yet, you can register on the Impresso Datalab Website.
In this notebook, you will learn how to:
These are some useful resources to complement this notebook's content:
Install dependencies:
Run the following cell to install the required packages:
impresso-py - See documentationipyleaflet - See documentationYou may need to restart the kernel to use updated packages. To do so, on Google Colab, go to Runtime and select Restart session.
%pip install -q ipyleaflet
from packaging.version import Version
MIN_VERSION = "0.9.15"
try:
from impresso import version
assert Version(version) >= Version(MIN_VERSION)
print(f"✔ impresso {version} is installed and up to date.")
except (ImportError, AssertionError):
%pip install --upgrade --force-reinstall impresso
By running the following cell, you create an instance of the Impresso client and authenticate it with the Impresso API.
The
impresso_sessionvariable stores an instance ofImpressoClient, which establishes a connection to the API using your authentication token. With this object, you can interact with the API to perform operations such as searching for content items, retrieving entities, and fetching facets.
The following command will prompt you to enter your Impresso token if it has not been authenticated recently (it expires after 8 hours). Paste your API token in the form and press Enter.
To get access to an Impresso API token, go to Impresso Datalab and select Get API Token on the menu.
from impresso import connect, OR, DateRange
client = connect()
Let's begin by searching for the top 100 place entities (or locations) mentioned in content items that talk about nuclear power plants in the first three decades following the second world war.
In Impresso, a Content Item is the smallest unit of editorial content within a newspaper or radio collection. This can be an article (for newspapers) or a radio show or episode (for radio programs). Content items can also vary by type, including articles, advertisements, tables, images, and more. Please note that when a newspaper does not have segmentation (OLR - Optical Layout Recognition) content items for this title correspond to pages.
Search will be conducted using French, English and German keywords. Results are sorted by frequency of location mentions (in descending order -count).
locations = client.search.facet(
"location",
term=OR("centrale nucléaire", "nuclear power plant", "Kernkraftwerk"),
date_range=DateRange("1945-01-01", "1975-01-01"),
# Increasing the limit above 100 might break the code.
limit=100,
order_by="-count"
)
locations
Now, you can get entities' metadata, including Wikidata details.
Impresso links place entities to Wikidata. Therefore, after a place entity has been recognised by the models, it is linked to a unique identifier (or Wikidata ID). For Switzerland, for example, the identifier is 'Q22036'. This helps, for instance, disambiguate entities. If an article mentions Washington, Wikidata will have a different ID for the US former president George Washington, the US capital city Washington D.C., and the US state Washington.
# Store place entities in a list
entities_ids = locations.df.index.tolist()
# Retrieve metadata on entities
entities = client.entities.find(entity_id=OR(*entities_ids), resolve=True, limit=len(entities_ids))
entities
In order to plot a map, you need the geographic coordinates of each location. However, not all locations have this information.
You will then filter out entities that have no coordinates.
import pandas as pd
# disable "copy-on-write" warning
pd.options.mode.copy_on_write = True
df = entities.df
entities_with_coordinates = df[df['wikidataDetails.coordinates.latitude'].notna() & df['wikidataDetails.coordinates.longitude'].notna()]
# as dataframe has too many columns, we are selecting only the ones we want to see in the output for now
selected_columns = entities_with_coordinates[['label', 'wikidataId', 'totalMentions', 'wikidataDetails.coordinates.latitude', 'wikidataDetails.coordinates.longitude', 'wikidataDetails.descriptions.fr']]
selected_columns
# You can check what the removed entities are by running this code (if result is empty, all entities have coordinates)
entities_without_coordinates = df[df['wikidataDetails.coordinates.latitude'].isna() & df['wikidataDetails.coordinates.longitude'].isna()]
entities_without_coordinates[['label', 'wikidataId', 'wikidataDetails.coordinates.latitude', 'wikidataDetails.coordinates.longitude', 'wikidataDetails.descriptions.fr']]
In the last column of the output for 'selected_columns', you see a description (in French) of every Wikidata ID. You can find the ones that are countries by checking whether the word 'pays' appears in column 'wikidataDetails.descriptions.fr'. If true, then we add a country tag to the new column 'is_country'.
# Check for the presence of the word 'pays' and assign TRUE or FALSE to new column 'is_country'
entities_with_coordinates['is_country'] = entities_with_coordinates['wikidataDetails.descriptions.fr'].str.contains('pays')
# Again, we are selecting only the columns we want to see in the output for now
selected_columns_countries = entities_with_coordinates[['label', 'wikidataId', 'totalMentions', 'wikidataDetails.descriptions.fr', 'is_country']]
selected_columns_countries
Now, you count how many times each place entity occurs in our dataframe and add this value to a new column named 'mentions_count'.
entities_with_coordinates['mentions_count'] = entities_with_coordinates.index.map(locations.df['count'])
# output shows the top 5 most mention entities in the documents you 'collected'
entities_with_coordinates[['label', 'wikidataId', 'wikidataDetails.descriptions.fr', 'mentions_count']].sort_values(by='mentions_count', ascending=False).head()
Below are some functions used to calculate extra details needed to plot data on a map.
This first one finds geo bounds of a group of items. This translates geo coordinates into positions on the map.
def find_bounds(coordinates):
"""
Finds the top/left, bottom/right bounds of an area that fits all coordinates.
Args:
coordinates: A list of coordinate tuples (latitude, longitude).
Returns:
A tuple containing the top/left and bottom/right bounds:
((top_lat, left_lon), (bottom_lat, right_lon))
"""
if not coordinates:
return None
min_lat = coordinates[0][0]
max_lat = coordinates[0][0]
min_lon = coordinates[0][1]
max_lon = coordinates[0][1]
for lat, lon in coordinates:
min_lat = min(min_lat, lat)
max_lat = max(max_lat, lat)
min_lon = min(min_lon, lon)
max_lon = max(max_lon, lon)
return ((max_lat, min_lon), (min_lat, max_lon))
This second function creates HTML code used for rendering the hover pop-up. This way, you can create an interactive map using ipyleaflet package.
from ipywidgets import HTML
from ipyleaflet import Popup
def build_hover_popup(title: str, subtitle: str, mentions: int) -> Popup:
message = HTML()
message.value = f"""
<div style="display: flex; flex-direction: column; color: black; line-height: normal; max-width: 200px;">
<b>{title}</b>
<p>{subtitle}</p>
<b>Mentions: {mentions}</b>
</div>
"""
# Popup with a given location on the map:
popup = Popup(
# location=center,
child=message,
close_button=False,
auto_close=True,
close_on_escape_key=False
)
return popup
Display entities on a map with pin size based on the number of mentions (more mentions = bigger pin). The pins are colored based on the type of the entity: country (red) or location (black). Location can be either a city, state, region, commune, etc.
Be mindful of people with colour blindness when choosing the colours of your visuals. Here, we opted for red and black due to the high contrast. The internet is full of tools to generate accessible palettes, which you can consult before deciding on the best colours.
from ipyleaflet import Map, Marker, AwesomeIcon, CircleMarker
map = Map(zoom=0)
country_icon = AwesomeIcon(
name='fa-globe',
marker_color='red',
spin=False,
)
place_icon = AwesomeIcon(
name='fa-building-o',
marker_color='green',
spin=False,
)
max_mentions_count = entities_with_coordinates['mentions_count'].max()
coordinates = []
markers = []
# Build markers
for index, row in entities_with_coordinates.iterrows():
lat = row['wikidataDetails.coordinates.latitude']
lon = row['wikidataDetails.coordinates.longitude']
label = row['wikidataDetails.labels.en']
description = row['wikidataDetails.descriptions.en']
is_country = row['is_country']
radius = (row['mentions_count'] / max_mentions_count) * 20
marker = CircleMarker(
location=(lat, lon),
draggable=False,
title=label,
color="red" if is_country else "black",
fill_color="red" if is_country else "black",
radius=int(radius)
)
marker.popup = build_hover_popup(label, description, row['mentions_count'])
coordinates.append((lat, lon))
markers.append(marker)
# Fit the map to the bounds
map.fit_bounds(find_bounds(coordinates))
# add markers
for m in markers:
map += m
display(map)
This is just an alternative visualisation. It displays entities on a map with a heatmap overlay. The colour intensity is higher where the entities are more concentrated and more frequently mentioned. This map style is, however, considerably less accessible. It's worth considering which map meets best your needs and the needs of your audiences.
from ipyleaflet import Map, Heatmap
map = Map(zoom=0)
locations = []
for index, row in entities_with_coordinates.iterrows():
lat = row['wikidataDetails.coordinates.latitude']
lon = row['wikidataDetails.coordinates.longitude']
# add every coordinate 30 times to make the heatmap more visible
locations.extend([(lat, lon) for i in range(30)])
heatmap = Heatmap(
locations=locations,
radius=20,
blur=10,
)
map.add(heatmap)
map
This notebook provided you with a step-by-step guide to retrieve place entities from Impresso corpus and visualise them on a map.
It is important to have in mind that only places that have been tagged as entity 'location' in the Impresso corpus will be added to this visualisation. Because of the way Named Entity Recognition (NER) works, it is possible that some places that are mentioned in the texts are not recognised as 'location' by the algorithms. In this case, those places will not be shown in the graph. For more information on NER, check our FAQ.
That's it for now! Next, you can explore the Exploring Entity Co-occurrence Networks notebook, which demonstrates how to create network graphs to represent persons that are mentioned in the Impresso corpus.
Writing - Original draft: Roman Kalyakin. Conceptualization: Marten Düring. Software: Roman Kalyakin. Writing - Review & Editing: Marten Düring, Caio Mello. Validation: Martin Grandjean, Kirill Veprikov, Cao Vy. Datalab editorial board: Caio Mello (Managing), Pauline Conti, Emanuela Boros, Marten Düring, Juri Opitz, Martin Grandjean, Estelle Bunout, Cao Vy. Data curation & Formal analysis: Maud Ehrmann, Emanuela Boros, Pauline Conti, Simon Clematide, Juri Opitz, Andrianos Michail. Methodology: Roman Kalyakin. Supervision: Marten Düring. Funding aquisition: Maud Ehrmann, Simon Clematide, Marten Düring, Raphaëlle Ruppen Coutaz.
This notebook is published under CC BY 4.0 License
For feedback on this notebook, please send an email to info@impresso-project.ch
Impresso - Media Monitoring of the Past is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. CRSII5_173719 and the second project (2023-2027) by the SNSF under grant No. CRSII5_213585 and the Luxembourg National Research Fund under grant No. 17498891.
All Impresso code is published open source under the GNU Affero General Public License v3 or later.