Lemmatization and POS-tagging process by using joint learning approach - Archive ouverte HAL Accéder directement au contenu
Chapitre D'ouvrage Année : 2020

Lemmatization and POS-tagging process by using joint learning approach

Résumé

Classical Armenian, Old Georgian and Syriac are under-resourced digital languages. Even though a lot of printed critical editions or dictionaries are available, there is currently a lack of fully tagged corpora that could be reused for automatic text analysis. In this paper, we introduce an ongoing project of lemmatization and POS-tagging for these languages, relying on a recurrent neural network (RNN), specific morphological tags and dedicated datasets. For this paper, we have combine different corpora previously processed by automatic out-of-context lemmatization and POS-tagging, and manual proofreading by the collaborators of the GREgORI Project (UCLouvain, Louvain-la-Neuve, Belgium). We intend to compare a rule based approach and a RNN approach by using PIE specialized by Calfa (Paris, France). We introduce here first results. We reach a mean accuracy of 91,63% in lemmatization and of 92,56% in POS-tagging. The datasets, which were constituted and used for this project, are not yet representative of the different variations of these languages through centuries, but they are homogenous and allow reaching tangible results, paving the way for further analysis of wider corpora.
Fichier non déposé

Dates et versions

hal-03971746 , version 1 (03-02-2023)

Identifiants

  • HAL Id : hal-03971746 , version 1

Citer

Chahan Vidal-Gorène, Bastien Kindt. Lemmatization and POS-tagging process by using joint learning approach: Experimental results on Classical Armenian, Old Georgian, and Syriac. Rachele Sprugnoli; Marco Passarotti. Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages, European Language Resources Association (ELRA), pp.22-27, 2020. ⟨hal-03971746⟩
3 Consultations
0 Téléchargements

Partager

Gmail Facebook Twitter LinkedIn More