The Impresso Datalab: Programmatic access to our data and models
One of our goals for Impresso2 is to make it easier to conduct data-driven research on historical media collections. Today we are excited to announce the release of the Impresso Datalab, a significant milestone in the Impresso2 project.
The Impresso Datalab complements the exploratory capacities of the Impresso Web App by allowing programmatic interactions with our data. The Datalab offers access to bibliographic metadata, semantic enrichments and full text via the Impresso API and a dedicated Python library.
Impresso Datalab
Key features
With this release, we offer:
Programmatic access to our data
Initialising an Impresso Client
The Impresso Rest API and Impresso Python library provide access to full text, bibliographic metadata, and semantic enrichments in compliance with legal frameworks and institutional constraints of our partners.
Notebooks for data exploration
Notebook on Visualising Place Entities on Maps
Notebook on Exploring Entity Co-occurrence Networks
Notebook templates are designed to complement the exploratory capacities of the Web App. With this first release we offer geospatial mapping of location entities contained in a query of collection as well as relational perspectives on entity cooccurrences by means of network visualisations.
Models and Annotation services to enrich your own data
Notebooks for enriching your own data
Example from notebook on Language Identification with impresso-pipelines Package
Researchers can semantically enrich their own data using Impresso’s specialized models (also available on HuggingFace) and ready-to-use pipelines specifically optimized for historical newspaper text analysis. At this stage, we offer a BERT model for the recognition of European press agencies and pipelines for language identification, topic modelling, named entity recognition and OCR quality assessment.
Close Integration Web App & Datalab
Try in Datalab feature in Impresso Web App
Example of results linking back to Impresso Web App
We strive for seamless, question-driven workflows between both interfaces for scalable reading and versatile exploration. For instance, you can easily export your Impresso Web App query to a Datalab notebook for in-depth analysis, then return to the Web App for detailed examination of specific texts. For convenience, all notebooks can be run via Google Colab but of course also locally based on user preference.
Getting started
- Create a free Impresso account (if you do not have one already) and subscribe to one of the Impresso plans
- Get an API key and familiarise yourself with the Impresso Python library to interact with our API to search and download data
- Experiment with our first notebooks and generate network and spatial views on Impresso data
- Enrich your own data using our pipelines and models for named entity recognition, press agency detection, language identification and OCR quality assessment
Note that this is only the beginning - the Datalab will remain in constant development throughout the Impresso project. More notebooks to support teaching, critical data exploration and data annotation will follow!
To enter the Datalab, login with your Impresso account or register for one, then accept our revised Terms of Use and request your API key.
We appreciate any feedback on its usage and welcome proposals for additional notebooks via info (at) impresso-project.ch.
Impresso Corpus Expansion
We are pleased to announce that we made a first step towards Impresso’s goal to create a corpus of Western European newspapers and radio sources: A first batch of newspapers from the National Library of France (BnF) have arrived (see below for the first titles we include). In addition, we have added long awaited titles to our Swiss newspaper collection. This includes “Schweizer Arbeitgeber” and “Schweizerische Handels-Zeitung” coming from the Swiss Economic Archives, a total of 43 titles from the regional collections of Bibliothèque Cantonale Universitaire de Lausanne (BCUL) as well as the German and French editions of the Swiss Federal Gazette also known as Bundesblatt or Feuille fédérale, a rich source which informs about Swiss political and legislative decision-making provided by the Swiss Federal Archives (SFA).
In total, this release adds 53 new newspaper titles, more than 180.00 issues and almost 11 million new content items, such as articles or adverts.
Explore new additions to our corpus from the following partners:
Impresso partners
From France, this first batch includes the following titles:
Front page from Le Petit Parisien
New User Access Management System
Alongside the Datalab, we are introducing a new content access management system which allows us to reflect the legal contexts in which our data-providing partners operate as described in our Terms of Use. New user plans reflect the legal frameworks within which our data-providing partners operate, as described in our Terms of Use. Behind the scenes, this new system allows us to grant fine-grained access on the level of individual content items such as newspaper articles or radio broadcasts.
Impresso User plans
We distinguish between:
- Guest users
- Basic users
- Student users
- Academic users
- Academic+ users (forthcoming)
To qualify for an Academic user account, we ask you to provide a link to your academic profile and to allow 2 working days for verification. With the forthcoming Academic+ user plan, we present an innovation for accessing protected data: users who wish to access data in this category will soon be able to make a request to the corresponding data provider and gain access upon their validation.
Please refer to this overview of the currently available data and the mapping of permitted actions according to user plans.
How to request a change of plan
Note: Existing Impresso users are mapped to the Basic user account by default. If applicable, please upgrade to the Academic plan using the method described above.
Connect with us
The Impresso project has made the choice to retire from X, for obvious reasons. We are now active on Bluesky instead. Follow us to stay updated on the latest developments, events, and insights from the Impresso project. We also have a Discord server where you can report issues you encounter with the Web App or Datalab, or discuss other Impresso-related topics.
This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.
A Creative Commons Attribution-NoDerivatives 4.0 (CC BY-ND 4.0) license applies
to all contents published in impresso. While articles published on impresso can
be copied by anyone for noncommercial purposes if proper credit is given,
all materials are published under an open-access license with authors retaining
full and permanent ownership of their work.