The following objectives served to guide the project between 2017 and 2020.
Historical newspapers are mirrors of past societies. Published over centuries on a regular basis, they record wars and minor events, report on international, national and local matters, and document day-to-day life. In a nutshell, they keep track of history at every level, and the wealth of information they offer as well as their inherent contextualisation makes them invaluable primary sources for historians.
After long remaining on library and archive shelves, newspapers are now undergoing mass digitisation, and millions of facsimiles, along with their machine-readable content acquired via Optical Character Recognition, are becoming accessible via a variety of online portals. While this represents a major step forward in terms of preservation of and access to documents, conducting research using these sources raises a number of problems, including a lack of text searchability as a result of poor text recognition and missing metadata, the relative isolation of digitised newspapers within their respective archives, search functions that are difficult to use, and poorly designed user interfaces.
However, recent progress in text analysis has also opened up new possibilities for conducting research on historical text collections. Opportunities include enhanced analysis capacities, with the possibility of automatically exploring the content of newspapers with an unprecedented combination of speed, depth and volume; a wider scope, with the ability to conduct comprehensive studies by comparing and contrasting viewpoints; and greater continuity, with the option of considering the entire lifespan of newspapers or collections of newspapers in a single study. This project explored the possibilities raised by these new techniques.
impresso opened up new ways of engaging with digital historical newspaper content and paved the way for new approaches to address historical questions..
To this end, the project addressed the following main objectives:
Our first objective was to develop multilingual and time-specific text mining techniques to transform noisy and unstructured textual content into semantically indexed, structured and linked data. More specifically, we built a series of NLP components for OCR post-correction, n-gram indexing, distributional semantic indexing, named entity processing and text categorisation and clustering.
The performance of these tools as applied to historical texts was systematically assessed through traditional summative evaluation using ground-truth references, and through formative evaluation with the collaboration of historians. This evaluation initiative was also extended to other systems via the organisation of a shared task (CLEF HIPE 2020).
Finally, structured data was stored in a fully traceable and interoperable historical semantic knowledge base. Interoperability and transparency support historians’ investigations and allow our partner libraries to reintegrate our annotations into their production systems, thereby ensuring the sustainability of our approach.
More information on developments in computational linguistics developments.
Our second objective was the development of a visualisation interface to enable the seamless exploration of vast amounts of complex historical data. This interface includes common search functionalities like keywords and faceted search but also, more importantly, novel functionalities to accommodate text analysis research tools and empower users to approach the system in a reflective way.
The interface gives access to different entry points into document collections, represents biases introduced by tools and gaps in document collections, offers contrasted views of aggregated material, provides provenance and reliability information for annotations, and enables user-driven data curation.
Interface designers and developers designed the system in close cooperation with historians and computational linguists according to the principles of co-design. This close cooperation ensured that quality standards in each discipline were met, that the desired goals were achieved, that new opportunities were identified and that problems were spotted and solved early in the development process. This approach also ensured that the application has an adequate learning curve for end users.
More information on design and visualization interface developments.
As a necessary complement to the first two objectives dedicated to digital implementation aspects, our third objective related to the usage and impact of the developed tools.
First, developing computational systems to support historical research raised numerous methodological and epistemological questions. In this regard, historians first considered the question of source criticism, examining the biases of “digital blind spots” introduced by digitisation, exploring the problem of considering digital and non-digital material together in a combined approach, and investigating the difficulty of source contextualisation in a digital landscape where spatial and institutional boundaries are less prominent. We also considered the question of digital scholarship, or how (digital) history needs to devise new ways of using what computer science can offer, as well as the pressing need to train future digital historians based on the direct usage of impresso tools in the classroom.
Second, the strengths and weaknesses of new tools and methods are best evaluated when they are put to productive use. Our dedicated historical use case will be Resistance against Europe. This research benefited from the text analysis and visualisation components of the project and provided ongoing feedback on their viability for historical research.
More information on digital history research.