From Manuscript to Tagged Corpora

Bastien Kindt; Chahan Vidal-Gorène

doi:10.30687/arm/2974-6051/2022/01/005

Article Dans Une Revue Armeniaca - International Journal of Armenian Studies Année : 2022

From Manuscript to Tagged Corpora

(1) , (2, 3)

1
2
3

Bastien Kindt

Fonction : Auteur
PersonId : 980342

Université Catholique de Louvain = Catholic University of Louvain

Chahan Vidal-Gorène

Fonction : Auteur
PersonId : 743619
IdHAL : chahan-vidal-gorene
ORCID : 0000-0003-1567-6508

Centre Jean Mabillon

Calfa

Résumé

Creating a digital corpus enriched by full linguistic annotations is a work which classically integrates several manual steps of acquisition, processing, and data display. Processing presupposes the existence of dedicated and specialised analysis tools, adapted to the state of the language used in the corpus. This paper describes a semi-supervised process for building Armenian corpora from scanned documents. This method is based on a chain of applications pre-trained by Calfa and GREgORI and enabling the complete processing of texts, from their automated input to their linguistic analysis and data display. We provide an assessment of this methodology and benefits of model specialisation, based on digitised copies of a 17th-century manuscript of the Four Gospels (Walters MS W541 = BAL W541, Amida Gospels, ff. 113v-117r: Lk 1:1‑78).

Mots clés

HTR Armenian Historical documents Lemmatization POS-tagging Digital Humanities

Humanités numériques

Domaines

Méthodes et statistiques Sciences de l'Homme et Société

Chahan Vidal-Gorène : Connectez-vous pour contacter le contributeur

https://enc.hal.science/hal-03971613

Soumis le : vendredi 3 février 2023-11:52:53

Dernière modification le : vendredi 19 avril 2024-16:18:58

Dates et versions

hal-03971613 , version 1 (03-02-2023)

Identifiants

HAL Id : hal-03971613 , version 1
DOI : 10.30687/arm/2974-6051/2022/01/005

Citer

Bastien Kindt, Chahan Vidal-Gorène. From Manuscript to Tagged Corpora: An Automated Process for Ancient Armenian or Other Under-Resourced Languages of the Christian East. Armeniaca - International Journal of Armenian Studies, 2022, 1, pp.73-96. ⟨10.30687/arm/2974-6051/2022/01/005⟩. ⟨hal-03971613⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

PSL ENC CAMPUS-CONDORCET CJM

14 Consultations

0 Téléchargements

From Manuscript to Tagged Corpora

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager