Glossary

« Oh, that's what you mean! » During our discussions, we realised chances are that one uses terms that others do not know, or understand differently. We therefore decided to gradually collect, across domains, all sorts of terms related to the production, processing and exploitation of digital historical media sources. We hope that the resulting glossary will facilitate mutual understanding in interdisciplinary settings around historical print media, and beyond. If you wish to suggest a correction or an addition, do not hesitate to email us.

ALTO

The Analyzed Layout and Text Object (ALTO) is an open XML standard to represent information of OCR recognized texts.

IIIF

The International Image Interoperability Framework (IIIF) defines several application programming interfaces that provide a standardised method of describing and delivering images over the web, as well as “presentation based metadata” (that is, structural metadata) about structured sequences of images.

OCR

OCR (optical character recognition) is the recognition of printed or written text characters by a computer. This involves photoscanning of the text character-by-character, analysis of the scanned-in image, and then translation of the character image into character codes, such as ASCII, commonly used in data processing.

OLR

In computer vision, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document.

Topic model

In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents. Topic modelling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body.