Handling Heavily Abbreviated Manuscripts: HTR engines vs text normalisation approaches

Jean-Baptiste Camps; Chahan Vidal-Gorène; Marguerite Vernet

doi:10.1007/978-3-030-86159-9_21

Communication Dans Un Congrès Année : 2021

Handling Heavily Abbreviated Manuscripts: HTR engines vs text normalisation approaches

(1) , (1) , (2)

1
2

Jean-Baptiste Camps

Fonction : Auteur
PersonId : 93
IdHAL : jbcamps
ORCID : 0000-0003-0385-7037
IdRef : 142859419

Centre Jean Mabillon

Chahan Vidal-Gorène

Fonction : Auteur
PersonId : 743619
IdHAL : chahan-vidal-gorene
ORCID : 0000-0003-1567-6508

Centre Jean Mabillon

Marguerite Vernet

Fonction : Auteur

École nationale des chartes

Résumé

Although abbreviations are fairly common in handwritten sources, particularly in medieval and modern Western manuscripts, previous research dealing with computational approaches to their expansion is scarce. Yet abbreviations present particular challenges to computational approaches such as handwritten text recognition and natural language processing tasks. Often, pre-processing ultimately aims to lead from a digitised image of the source to a normalised text, which includes expansion of the abbreviations. We explore different setups to obtain such a normalised text, either directly, by training HTR engines on normalised (i.e., expanded, disabbreviated) text, or by decomposing the process into discrete steps, each making use of specialist models for recognition, word segmentation and normalisation. The case studies considered here are drawn from the medieval Latin tradition.

Mots clés

Abbreviations Handwritten Text Recognition Medieval Western Manuscripts Paleography Computational methods

Domaines

Histoire Linguistique

Fichier principal

IWCP2021_Handling_Abreviations_ArXiv.pdf (3.99 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Jean-Baptiste Camps : Connectez-vous pour contacter le contributeur

https://enc.hal.science/hal-03279602

Soumis le : mardi 6 juillet 2021-15:49:06

Dernière modification le : mercredi 17 novembre 2021-12:33:09

Archivage à long terme le : jeudi 7 octobre 2021-18:55:19

Dates et versions

hal-03279602 , version 1 (06-07-2021)

Licence

Paternité - Partage selon les Conditions Initiales

Identifiants

HAL Id : hal-03279602 , version 1
DOI : 10.1007/978-3-030-86159-9_21

Citer

Jean-Baptiste Camps, Chahan Vidal-Gorène, Marguerite Vernet. Handling Heavily Abbreviated Manuscripts: HTR engines vs text normalisation approaches. International Conference on Document Analysis and Recognition 2021, 2021, Lausanne, Switzerland. pp.306-316, ⟨10.1007/978-3-030-86159-9_21⟩. ⟨hal-03279602⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

PSL ENC CAMPUS-CONDORCET CJM

175 Consultations

186 Téléchargements

Handling Heavily Abbreviated Manuscripts: HTR engines vs text normalisation approaches

Résumé

Mots clés

Domaines

Dates et versions

Licence

Identifiants

Citer

Exporter

Collections

Altmetric

Partager