Historical newspapers are mirrors of past societies. Published over centuries on a regular basis, they record wars and minor events, report on international, national and local matters, and document day-to-day life. In a nutshell, they keep track of history at every level, and the wealth of information they offer as well as their inherent contextualisation makes them invaluable primary sources for historians.
After long remaining on library and archive shelves, newspapers are now undergoing mass digitisation, and millions of facsimiles, along with their machine-readable content acquired via Optical Character Recognition, are becoming accessible via a variety of online portals. While this represents a major step forward in terms of preservation of and access to documents, conducting research using these sources raises a number of problems, including a lack of text searchability as a result of poor text recognition and missing metadata, the relative isolation of digitised newspapers within their respective archives, search functions that are difficult to use, and poorly designed user interfaces.
However, recent progress in text analysis has also opened up new possibilities for conducting research on historical text collections. Opportunities include enhanced analysis capacities, with the possibility of automatically exploring the content of newspapers with an unprecedented combination of speed, depth and volume; a wider scope, with the ability to conduct comprehensive studies by comparing and contrasting viewpoints; and greater continuity, with the option of considering the entire lifespan of newspapers or collections of newspapers in a single study. This project aims to explore the possibilities raised by these new techniques.
impresso aims to open up new ways of engaging with digital historical newspaper content and pave the way for new approaches to historical questions.
To this end, the project will address the following main objectives:
Our first objective is to develop multilingual and time-specific text mining techniques to transform noisy and unstructured textual content into semantically indexed, structured and linked data. More specifically, we are aiming to build a series of NLP components for OCR post-correction, n-gram indexing, distributional semantic indexing, named entity processing and text categorisation and clustering.
The performance of these tools as applied to historical texts will be systematically assessed through traditional summative evaluation using ground-truth references, and through formative evaluation with the collaboration of historians. This evaluation initiative will also be extended to other systems via the creation of a shared task.
Finally, structured data will be stored in a fully traceable and interoperable historical semantic knowledge base. Interoperability and transparency will support historians’ investigations and allow our partner libraries to reintegrate our annotations into their production systems, thereby ensuring the sustainability of our approach.
More information on developments in computational linguistics developments.
Our second objective is the development of a visualisation interface to enable the seamless exploration of vast amounts of complex historical data. This interface will include common search functionalities like keywords and faceted search but also, more importantly, novel functionalities to accommodate text analysis research tools and empower users to approach the system in a reflective way.
The interface will give access to different entry points into document collections, represent biases introduced by tools and gaps in document collections, offer contrasted views of aggregated material, provide provenance and reliability information for annotations, and enable user-driven data curation.
Interface designers and developers will design and test the system in close cooperation with historians and computational linguists according to the principles of co-design. This close cooperation will ensure that quality standards in each discipline are met, that the desired goals are achieved, that new opportunities are identified and that problems are spotted and solved early in the development process. This approach also ensures that the system will have an adequate learning curve for end users.
More information on design and visualization interface developments.
As a necessary complement to the first two objectives dedicated to digital implementation aspects, our third objective relates to the usage and impact of the developed tools.
First, developing computational systems to support historical research raises numerous methodological and epistemological questions. In this regard, historians will first consider the question of source criticism, examining the biases of “digital blind spots” introduced by digitisation, exploring the problem of considering digital and non-digital material together in a combined approach, and investigating the difficulty of source contextualisation in a digital landscape where spatial and institutional boundaries are less prominent. We will also consider the question of digital scholarship, or how (digital) history needs to devise new ways of using what computer science can offer, as well as the pressing need to train future digital historians based on the direct usage of impresso tools in the classroom.
Second, the strengths and weaknesses of new tools and methods are best evaluated when they are put to productive use. Our dedicated historical use case will be Resistance against Europe. This research will benefit from the text analysis and visualisation components of the project and provide ongoing feedback on their viability for historical research.
More information on digital history research.
Our efforts will primarily focus on the digital archives of French and German newspapers in the 19th and (early) 20th centuries provided by our associated partners (LINK).
The core set of sources is currently under development; more information will be posted on this early 2018. In addition to our Swiss-Luxembourgish core set, we intend to incorporate other sources (in German, French, English and possibly Italian) from other institutions.
WP1: Project management.
WP2: System Design and Data Management
WP3: Natural Language Processing and Text Mining
WP4: Annotation and benchmarking
WP6: Digital History methodology and investigations
WP7: Dissemination and Exploitation