Glossary

OCR

OCR (optical character recognition) is the recognition of printed or written text characters by a computer. This involves photoscanning of the text character-by-character, analysis of the scanned-in image, and then translation of the character image into character codes, such as ASCII, commonly used in data processing.

In OCR processing, the scanned-in image or bitmap is analyzed for light and dark areas in order to identify each alphabetic letter or numeric digit. When a character is recognized, it is converted into an ASCII code. Special circuit boards and computer chips designed expressly for OCR are used to speed up the recognition process.

OCR is being used by libraries to digitize and preserve their holdings. OCR is also used to process checks and credit card slips and sort the mail. Billions of magazines and letters are sorted every day by OCR machines, considerably speeding up mail delivery.



You may also be interested in the following terms :

ALTO

The Analyzed Layout and Text Object (ALTO) is an open XML standard to represent information of OCR recognized texts.

IIIF

The International Image Interoperability Framework (IIIF) defines several application programming interfaces that provide a standardised method of describing and delivering images over the web, as well as “presentation based metadata” (that is, structural metadata) about structured sequences of images.

OCR

OCR (optical character recognition) is the recognition of printed or written text characters by a computer. This involves photoscanning of the text character-by-character, analysis of the scanned-in image, and then translation of the character image into character codes, such as ASCII, commonly used in data processing.

OLR

In computer vision, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document.

Topic model

In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents. Topic modelling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body.