Ground-truth Free Evaluation of HTR on Old French and Latin Medieval Literary Manuscripts

Thibault Clérice

Communication Dans Un Congrès Année : 2022

Ground-truth Free Evaluation of HTR on Old French and Latin Medieval Literary Manuscripts

(1, 2, 3, 4)

1
2
3
4

Thibault Clérice

Fonction : Auteur
PersonId : 15153
IdHAL : thibault-clerice
ORCID : 0000-0003-1852-9204
IdRef : 221533877

École nationale des chartes

Centre Jean Mabillon

Histoire et Sources des Mondes antiques

Université Paris Sciences et Lettres

Résumé

As more and more projects openly release ground truth for handwritten text recognition (HTR), we expect the quality of automatic transcription to improve on unseen data. Getting models robust to scribal and material changes is a necessary step for specific data mining tasks. However, evaluation of HTR results requires ground truth to compare prediction statistically. In the context of modern languages, successful attempts to evaluate quality have been done using lexical features or n-grams.This, however, proves difficult in the context of spelling variation that both Old French and Latin have, even more so in the context of sometime heavily abbreviated manuscripts. We propose a new method based on deep learning where we attempt to categorize each line error rate into four error rate ranges (0 < 10% < 25% < 50% < 100%) using three different encoder (GRU with Attention, BiLSTM, TextCNN). To train these models, we propose a new dataset engineering approach using early stopped model, as an alternative to rule-based fake predictions. Our model largely outperforms the n-gram approach. We also provide an example application to qualitatively analyse our classifier, using classification on new prediction on a sample of 1,800 manuscripts ranging from the 9 th century to the 15 th .

Mots clés

HTR OCR Quality Evaluation Historical languages Spelling Variation

Domaines

Informatique [cs] Littératures

Fichier principal

CHR2022___State_of_HTR.pdf (5.68 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Thibault Clérice : Connectez-vous pour contacter le contributeur

https://enc.hal.science/hal-03828529

Soumis le : mardi 25 octobre 2022-12:52:47

Dernière modification le : vendredi 19 avril 2024-16:18:58

Archivage à long terme le : vendredi 27 janvier 2023-07:37:43

Dates et versions

hal-03828529 , version 1 (25-10-2022)

Identifiants

HAL Id : hal-03828529 , version 1

Citer

Thibault Clérice. Ground-truth Free Evaluation of HTR on Old French and Latin Medieval Literary Manuscripts. Computational Humanities Research Conference (CHR) 2022, Dec 2022, Antwerp, Belgium. ⟨hal-03828529⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-ST-ETIENNE ENS-LYON UNIV-LYON3 CNRS UNIV-LYON2 MOM PSL HISOMA DIM-MAP ENC CAMPUS-CONDORCET CJM UDL

294 Consultations

174 Téléchargements

Ground-truth Free Evaluation of HTR on Old French and Latin Medieval Literary Manuscripts

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager