Exploring Entity Co-occurrence Networks

If something doesn't work, you can report a problem.

What is this notebook about?

This notebook guides you step-by-step through how to create network graphs that represent people's connections in historical newspapers. We define people's connections based on whether their names are mentioned in the same content item.

In Impresso, a Content Item is the smallest unit of editorial content within a newspaper or radio collection. This can be an article (for newspapers) or a radio show or episode (for radio programs). Content items can also vary by type, including articles, advertisements, tables, images, and more. Please note that when a newspaper does not have segmentation (OLR - Optical Layout Recognition) content items for this title correspond to pages.

With this notebook, you can produce a representation of the media narratives by looking at how people have been associated to others by the press. Interpreting this association, however, can be tricky. It doesn't mean necessarily that those people have had any type of relationship. It just means that their names have been mentioned in, for example, the same news article. Understanding the reasons behind a co-occurrence typically requires further contextual or qualitative analysis.

What will you learn?

By completing this notebook, you will learn how to:

Retrieve a list of persons (named entities) mentioned in content items for a given query;
Transform this list of entities into a dataframe suitable for generating co-occurrence network graphs;
Create and display an interactive network graph to visualise connections between persons mentioned together in Impresso content;
Export the resulting dataframes as CSV files to support reproducibility;
Save the network graph in different formats (png, svg, gexf, and json) for further analysis.

Useful resources

If you’d like to go deeper into network analysis or its use in historical research, the following resources are recommended:

From Hermeneutics to Data to Networks: Data Extraction and Network Visualization of Historical Sources: A conceptual and practical guide to extracting structured data from historical sources and creating meaningful network visualizations.
Exploring and Analyzing Network Data with Python: An introduction to working with the NetworkX package and drawing conclusions from network metrics when working with humanities data.

Additional references:

Prerequisites

Install dependencies:

You may need to restart the kernel to use updated packages. To do so, on Google Colab, go to Runtime and select Restart session.

%pip install -q ipysigma networkx tqdm

from packaging.version import Version

MIN_VERSION = "0.9.15"
try:
    from impresso import version
    assert Version(version) >= Version(MIN_VERSION)
    print(f"✔ impresso {version} is installed and up to date.")
except (ImportError, AssertionError):
    %pip install --upgrade --force-reinstall impresso

Connect to Impresso:

The following command will prompt you to enter your Impresso token if it has not been authenticated recently (it expires after 8 hours).

To get access to an Impresso API token, go to Impresso Datalab and select Get API Token on the menu.

from impresso import connect, OR, AND

client = connect()

Part 1. Prepare your data

Get entities and their co-occurrences

First, we retrieve the top 100 most frequently mentioned person entities in all articles that talk about the Prague Spring using search facets method from the Impresso Python library.

Facets are properties of the data. In Impresso, some facets you can use to filter articles are 'language', 'newspaper', 'date of publication', etc. By using search facets method, you can filter by, for example, 'persons' as shown in the example below.

query = OR("Prague Spring", "Prager Frühling", "Printemps de Prague")

persons = client.search.facet(
  facet="person",
  term=query,
  order_by="-count",
  limit=100
)
persons

The result is a list of the 100 most frequently mentioned person entities, where each entry includes:

a unique identifier (value),
the number of times the person is mentioned (count),
and the display name (label).

These 100 entries are the most frequent out of a total of 2,355 persons mentioned in all matched content items.

Next, we generate all unique pairs of entities with a mention count higher than n. This will filter out pairs of entities that are mentioned just a few times.

For now, we are just combining all the entities in pairs. The documents in which they occur will be found later.

First, entities that meet the mention threshold are selected, and then all possible pairs are generated using the itertools.combinations function.

The n value can be adjusted so that we don't get too many entity combinations. A sweet spot is just under 500 combinations. Keeping the number of combinations under 500 is typically recommended to avoid API throttling.

import itertools

n = 30

df = persons.df
df = df[df["count"] > n]
persons_ids = df.index.tolist()
print(f"Total persons selected: {len(persons_ids)}")

person_ids_combinations = list(itertools.combinations(persons_ids, 2))
print(f"Total combinations: {len(person_ids_combinations)}")

# The code below outputs an Exception message in case the number of combinations exceed 500. 
# If this happens to you, try to increase the value of 'n'. 

if len(person_ids_combinations) > 500:
  msg = (
      f"The number of combinations is quite high ({len(person_ids_combinations)}). " +
      "This may put a lot of load on Impresso and your requests may be throttled. " +
      "Try to increase the threshold number of mentions in the cell above which will reduce the number of selected persons. " +
      "You can also disable this error by commenting out this cell, if this number of combinations is expected."
  )
  raise Exception(msg)

Find articles where the entity pairs occur

We also retrieve the dates and the number of articles where person entity pairs occur.

This piece of code gets a facet for every combination of named entities. It is a single call per combination so it may take a while for a large number of combinations.

from impresso.util.error import ImpressoError
from time import sleep
from tqdm import tqdm

connections = []

# iterate over entity combinations, and build a query from each pair, faceting on `daterange`
# the `query` variable hold the same value as above, i.e. keyword search for articles
for idx, combo in tqdm(enumerate(person_ids_combinations), total=len(person_ids_combinations)):
  try:
    result = client.search.facet(
      facet="daterange",
      term=query,
      entity_id=AND(*combo),
      limit=1000
    )
  except ImpressoError as e:
    # a 429 status code means that the request has been throttled
    # we sleep for 2 seconds and try again
    if e.error.status == 429:
      print(f"Request throttled for {combo}. Retrying in 2s...")
      sleep(2)
      try:
          result = client.search.facet(
          facet="daterange",
          term=query,
          entity_id=AND(*combo),
          limit=1000
           )
      except ImpressoError as e2:
        print(f"Retry failed for {combo}: {e2}")
      else:
        print(f"Error with {combo}: {e}")
      
  if result.size > 0:
    df = result.df

    items = list(zip(df.index.tolist(), df['count'].tolist(), [result.url for i in range(len(df))]))
    connections.append((combo, items))

We put all in a dataframe. Each row represents a co-occurrence event between two named persons in the Impresso dataset, for a specific date.

The dataframe includes:

node_a, node_b: the unique identifiers of the co-mentioned persons;
timestamp: the date of publication of the articles where they co-occurred;
count: the number of articles on that date mentioning both entities;
url: a direct link to the matching articles in the Impresso web app.

import pandas as pd

connections_denormalised = []
for (node_a, node_b), edges in connections:
    for ts, count, url in edges:
        connections_denormalised.append([node_a, node_b, ts, count, url])

connections_df = pd.DataFrame(connections_denormalised, columns=('node_a', 'node_b', 'timestamp', 'count', 'url'))
connections_df

We then save the connections into a CSV file that can be visualised independently in Part 2. Here you will be prompted to provide a name for the file.

from tempfile import gettempdir

temp_dir = gettempdir()

connections_csv_filename = input("Enter the filename: ").replace(" ", "_")
connections_csv_filepath = f"{temp_dir}/{connections_csv_filename}.csv"
connections_df.to_csv(connections_csv_filepath)
print(f"File saved in {connections_csv_filepath}")

# download your csv file (if using Google Colab)
try:
    from google.colab import files
    files.download(connections_csv_filepath)
except ImportError:
    print("Google Colab not detected. Please download the file manually from the path above.")

Part 2: Visualise your data

Import the CSV file you created in Part 1:

import pandas as pd

connections_df = pd.read_csv(connections_csv_filepath)
connections_df

Now, we group results by frequency of pairs to create connections (edges), and count the number of connections. We also preserve the URL.

The URL does not contain DateRange information, that's why they can be grouped here as they just refer to the search terms and the pair of persons occuring in documents.

grouped_connections_df = connections_df.groupby(['node_a', 'node_b']) \
    .agg({'timestamp': lambda x: ', '.join(list(x)), 'count': 'sum', 'url': lambda x: list(set(x))[0]}) \
    .reset_index()
grouped_connections_df

In the cell below, we use the NetworkX python library, designed for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.

Here, we start creating our network by defining the 'source' and 'target', as well as the edges attributes.

import networkx as nx

# Create a MultiGraph from the edge list with count and url as edge attributes
G = nx.from_pandas_edgelist(
    grouped_connections_df,
    source='node_a',
    target='node_b',
    edge_attr=['count', 'url'],
    create_using=nx.MultiGraph()
)

# Add a URL attribute to each node linking to its Impresso entity page
for i in sorted(G.nodes()):
    G.nodes[i]['url'] = f"https://impresso-project.ch/app/entities/{i}"
G.nodes

To ensure reprocibility, save the file so that it can be downloaded and used elsewhere.

The file format GEFX is compatible with other network analysis tools like Gephi.

from tempfile import gettempdir

temp_dir = gettempdir()

gefx_filename = input("Enter the gefx filename: ").replace(" ", "_")
gefx_filepath = f"{temp_dir}/{gefx_filename}.gefx"

nx.write_gexf(G, gefx_filepath)

print(f"File saved in {gefx_filepath}")

# Download file (if using Google Colab) 
try:
    from google.colab import files
    files.download(gefx_filepath)
except ImportError:
    print("Google Colab not detected. Please download the file manually from the path above.")

If running in Colab, activate custom widgets to allow ipysigma to render the graph by running the cell below.

Ipysigma allows you to produce an interactive graph, as well as manipulate the graph's settings.

try:
    from google.colab import output
    output.enable_custom_widget_manager()
except:
    pass

Run the cell below to render the graph.

The output will prompt you to choose from a dropdown list 'what should represent the size of the nodes', i.e. which centrality measure should determine the size of the nodes in your graph. Select it before you continue. These measures help reveal the structural importance of each node within the network.

import ipywidgets

node_size_widget = ipywidgets.Dropdown(
    options=['Degree', 'Betweenness', 'Eigenvector', 'Closeness'],
    value='Degree',
    disabled=False,
    layout={'width': 'max-content'}
)
ipywidgets.Box(
    [
        ipywidgets.Label(value='What should represent the size of the nodes:'), 
        node_size_widget
    ]
)

The next cell reads the node size method chosen above and plots the visualisation.

If you want to change the centrality measure above, re-run the next cell to update the visualisation.

import networkx as nx
from ipysigma import Sigma

# Importing a gexf graph
g = nx.read_gexf(gefx_filepath)

node_size = None
# Read node size method
match node_size_widget.value:
    case 'Degree':
        node_size = g.degree
    case 'Betweenness':
        node_size = nx.betweenness_centrality(g)
    case 'Eigenvector':
        node_size = nx.eigenvector_centrality(g)
    case 'Closeness':
        node_size = nx.closeness_centrality(g)
    case _:
        node_size = g.degree

print(f"Node size method: {node_size_widget.value}.")
print("See the following link for more information about centrality measures: https://networkx.org/documentation/stable/reference/algorithms/centrality.html")

# node size based on the selected centrality measure
# edge thickness based on co-occurrence count
Sigma(g, node_size=node_size, edge_size='count', clickable_edges=True, )

The graph display allows you to download the visualisation in png, svg, gexf, and json.

Limitations

Producing network graphs can be a good strategy to explore mentions to people in your collections. However, it's important to be mindful of some limitations before interpreting your results.

In this notebook, we showed you how to produce network graphs based on entities retrieved from the Impresso API. These are entities linked to wikidata, the same you see in the Impresso Web App. Linking entities to wikidata helps with differentiating, for example, whether a mention to 'apple' refers to the company or the fruit, by linking the entity to a unique ID. However, if we just look at linked entities, we might be ignoring other important entities that either could not be linked to wikidata due to technical limitations of the models, or entities that do not exist in wikidata.

It is also possible that some entities that do exist in the content items do not show up in your network due to:

Grammatical errors or OCR mistakes (such as Fidel Castros, instead of Fidel Castro);
Difficulty in recognising persons when only surnames are mentioned (such as distinguishing between John F. Kennedy and Robert Kennedy);
Names with highly variable spellings across languages (such as Khrushchev, Khrouchtchev, or Chruschtschow).

Conclusion

This notebook provided you with a comprehensive pipeline to create network graphs using the Impresso corpus.

It is important to have in mind that only persons who have been tagged as entity 'persons' in the Impresso corpus will be added to this graph. Because of the way Named Entity Recognition (NER) works, it is possible that some people that are mentioned in the texts are not recognised as 'person' by the algorithms. In this case, those people will not be shown in the graph. For more information on NER, check our FAQ.

Next Steps

That's it for now! Next, you can explore the Visualising Place Entities on Maps notebook, which demonstrates how to visualise in a map mentions to places in the Impresso corpus.

Also, a suggestion of other resources:

Introduction to Social Network Analysis: Youtube tutorials by Martin Grandjean reviewing the main concepts of social network analysis, and highlighting the challenges that arise when analyzing relational historical objects.

Demystifying Networks, Parts I & II by Scott B. Weingart: An older but still interesting resource with a simple introduction to networks, including concept definitions and key vocabulary

The Six Degrees of Francis Bacon Project: A DH project that reconstructs the social network of early modern intellectual life in Britain and includes publications and methodology.

Historical Network Research Community: A hub for scholars working at the intersection of history and network analysis. Offers conference proceedings, reading lists, and tutorials.

Project and License info

Notebook credits

Writing - Original draft: Roman Kalyakin. Conceptualization: Roman Kalyakin, Marten Düring. Software: Roman Kalyakin. Writing - Review & Editing: Caio Mello. Validation: Marten Düring, Ferdaous Affan, Kirill Veprikov, Cao Vy. Datalab editorial board: Caio Mello (Managing), Pauline Conti, Emanuela Boros, Marten Düring, Juri Opitz, Martin Grandjean, Estelle Bunout, Cao Vy. Data curation & Formal analysis: Maud Ehrmann, Emanuela Boros, Pauline Conti, Simon Clematide, Juri Opitz, Andrianos Michail. Methodology: Roman Kalyakin. Supervision: Marten Düring. Funding aquisition: Maud Ehrmann, Simon Clematide, Marten Düring, Raphaëlle Ruppen Coutaz.

This notebook is published under CC BY 4.0 License

For feedback on this notebook, please send an email to info@impresso-project.ch

Impresso project

Impresso - Media Monitoring of the Past is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. CRSII5_173719 and the second project (2023-2027) by the SNSF under grant No. CRSII5_213585 and the Luxembourg National Research Fund under grant No. 17498891.

License

All Impresso code is published open source under the GNU Affero General Public License v3 or later.

Impresso Project Logo

notebook

Explore and analyse your Impresso Data
Graph Network Visualisations