The project produced a historical media monitoring tool suite that bridged the semantic gap between huge volumes of scanned text and humanities scholars willing to understand and interpret its content.
The main objective of the tool suite is to overcome the current limitations of state-of-the-art keyword-based methods and to enable new search and discovery capabilities. The suite is composed of a series of natural language processing (NLP) components that process historical print media texts and store the extracted information in a knowledge base (KB), along with the original texts and facsimiles. As well as being open source, it is also modular, in order to meet the needs of different user groups, and interoperable, so that it can be integrated into third-party tools or frameworks. More specifically, the developed text mining tools allows users to perform search and filter functions in the lexical and semantic spaces of single words and multi-word expressions, the referential space of entities and the conceptual space of topics/categories of pages and documents, while taking into account the temporal dimension that cuts across all these levels and the different languages present in our collection.
At the lexical-semantic level, moving beyond traditional n-gram counts, enables users to examine how a word has changed its meaning through time and investigate how a concept is expressed in different languages, while also suggesting relevant synonyms and variants. At the entity level, the tool suite can be used to look for a specific person, place or institution and its contexts and associated information, as well as to explore aggregated views of such entities, for example how often they occur in texts at a specific time and with which other entities they are regularly mentioned. At the conceptual level, the objective is to enable the exploration of article topics, namely their definition and distribution over different sources and through time. All three levels are jointly leveraged through a faceted search function. Finally, we also provide recommendations, i.e. the ability to look for semantically related items (words, entities or articles).
Work on natural language processing for historical newspapers included the following:
automatic OCR and OLR improvement and text corpus creation – Beyond the establishment of a structured representation of the corpus material (OCR content, article segmentation), we experimented with OCR post-correction techniques based on character-based machine translation approaches. Due to the particular problems with legacy OCR output of texts typeset in Gothic font and the recent progress of neural OCR methods, we focused on building our own Gothic recognition model for newspapers and on improving OCR by reOCRizing from facsimiles.
lexical processing – We applied part-of-speech tagging and lemmatisation to the data, using supervised learning of lemmatisation to transfer inflection patterns from contemporary to older vocabulary.
word alignment for domain-specific cross-language semantic similarity – In this regard, we applied statistical word alignment techniques and make use of bilingual word embeddings to detect correspondences between words across languages.
semantically deepened n-grams – We support our users by suggesting semantically similar expressions that match their initial search terms. To this end, we apply methods based on word embeddings and develop diachronic word embeddings in order to track semantic shifts in words. The n-gram viewer allows the user to look at the occurrences of a lexical item over time and export this data in a simple JSON format. Any of the available search facets (newspapers, topics, named entities, etc.) can be applied to restrict the text material used for the statistics.
key phrases – We experiment with automatic keyphrase extraction in order to offer the smallest meaningful terms to describe a content item. Our keyphrase extraction impresso LAB illustrates the outcome of the techniques we applied. .
text categorisation and topic modelling – An interesting research question that we want to address is how we can transfer and adapt text categorisation models that are trained on contemporary gold standard material, e.g. by going back in time with the longitudinal text material of our collection. named entity processing – We recognised, classified and linked entities of type Person, Place and Organisation (mainly of administrative nature).
Additionally, the performance of OCR correction, NE processing and text categorisation was systematically evaluated. We produced gold standards for each of these tasks, which were also used in HIPE. a shared task on NE processing.
Finally, in terms of data management and knowledge representation, our requirements were: a) a high level of interoperability of data and tools regarding format and models, b) modularity, c) the possibility to trace provenance and d) the possibility to contextualise information in relation to its original sources (more on this topic).