How OCR Performance can Impact on the Automatic Extraction of Dictionary Content Structures

Résumé

In the last decade, OCR progress has triggered a massive trend towards the digitisation of legacy documents, with several Digital Humanities projects exploring means for structuring retro-digitised dictionaries. However there is a lack of awareness of the impact of the OCRs quality on the information extraction process. In this work, we shed light on the relationship between these two steps through experiments carried out with a TEI-based system for automatic parsing of dictionaries.

Publication
19th annual Conference and Members’ Meeting of the Text Encoding Initiative Consortium (TEI) -What is text, really? TEI and beyond
Pedro Javier Ortiz Suárez
Pedro Javier Ortiz Suárez
Doctorant

Je suis doctorant en informatique à Sorbonne Université et à l’équipe de recherche ALMAnaCH à Inria