From Manuscript to Tagged Corpora - École nationale des chartes Accéder directement au contenu
Article Dans Une Revue Armeniaca - International Journal of Armenian Studies Année : 2022

From Manuscript to Tagged Corpora

Résumé

Creating a digital corpus enriched by full linguistic annotations is a work which classically integrates several manual steps of acquisition, processing, and data display. Processing presupposes the existence of dedicated and specialised analysis tools, adapted to the state of the language used in the corpus. This paper describes a semi-supervised process for building Armenian corpora from scanned documents. This method is based on a chain of applications pre-trained by Calfa and GREgORI and enabling the complete processing of texts, from their automated input to their linguistic analysis and data display. We provide an assessment of this methodology and benefits of model specialisation, based on digitised copies of a 17th-century manuscript of the Four Gospels (Walters MS W541 = BAL W541, Amida Gospels, ff. 113v-117r: Lk 1:1‑78).
Fichier non déposé

Dates et versions

hal-03971613 , version 1 (03-02-2023)

Identifiants

Citer

Bastien Kindt, Chahan Vidal-Gorène. From Manuscript to Tagged Corpora: An Automated Process for Ancient Armenian or Other Under-Resourced Languages of the Christian East. Armeniaca - International Journal of Armenian Studies, 2022, 1, pp.73-96. ⟨10.30687/arm/2974-6051/2022/01/005⟩. ⟨hal-03971613⟩
14 Consultations
0 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More