From Manuscript to Tagged Corpora

« previous article | next article »

DC Field	Value
dc.identifier	ECF_article_7618
dc.title	From Manuscript to Tagged Corpora. An Automated Process for Ancient Armenian or Other Under-Resourced Languages of the Christian East
dc.contributor.author	Kindt Bastien
dc.contributor.author	Vidal-Gorène Chahan
dc.publisher	Edizioni Ca’ Foscari - Venice University Press, Fondazione Università Ca’ Foscari
dc.type	Research Article
dc.language.iso	en
dc.identifier.uri	http://edizionicafoscari.it/en/edizioni4/riviste/armeniaca/2022/1/from-manuscript-to-tagged-corpora/
dc.description.abstract	Creating a digital corpus enriched by full linguistic annotations is a work which classically integrates several manual steps of acquisition, processing, and data display. Processing presupposes the existence of dedicated and specialised analysis tools, adapted to the state of the language used in the corpus. This paper describes a semi-supervised process for building Armenian corpora from scanned documents. This method is based on a chain of applications pre-trained by Calfa and GREgORI and enabling the complete processing of texts, from their automated input to their linguistic analysis and data display. We provide an assessment of this methodology and benefits of model specialisation, based on digitised copies of a 17th-century manuscript of the Four Gospels (Walters MS W541 = BAL W541, Amida Gospels, ff. 113v-117r: Lk 1:1‑78).
dc.relation.ispartof	Armeniaca
dc.relation.ispartof	Vol. 1 – Ottobre 2022
dc.issued	2022-10-28
dc.dateAccepted	2022-03-23
dc.dateSubmitted	2021-12-20
dc.identifier.issn
dc.identifier.eissn	2974-6051
dc.rights	Creative Commons Attribution 4.0 International Public License
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/
dc.identifier.doi	10.30687/arm/9372-8175/2022/01/005
dc.peer-review	yes
dc.subject	Armenian
dc.subject	Computational philology
dc.subject	Handwritten text recognition
dc.subject	Lemmatisation
dc.subject	Morphosyntactic analysis
dc.subject	Tagged corpora

Research Article
open access peer reviewed

An Automated Process for Ancient Armenian or Other Under-Resourced Languages of the Christian East

Bastien Kindt Université Catholique de Louvain, Belgique

Chahan Vidal-Gorène École Nationale des Chartes, Paris

VIEW PDF DOWNLOAD PDF

abstract

Creating a digital corpus enriched by full linguistic annotations is a work which classically integrates several manual steps of acquisition, processing, and data display. Processing presupposes the existence of dedicated and specialised analysis tools, adapted to the state of the language used in the corpus. This paper describes a semi-supervised process for building Armenian corpora from scanned documents. This method is based on a chain of applications pre-trained by Calfa and GREgORI and enabling the complete processing of texts, from their automated input to their linguistic analysis and data display. We provide an assessment of this methodology and benefits of model specialisation, based on digitised copies of a 17th-century manuscript of the Four Gospels (Walters MS W541 = BAL W541, Amida Gospels, ff. 113v-117r: Lk 1:1‑78).

Published

Oct. 28, 2022

Accepted

March 23, 2022

Submitted

Dec. 20, 2021

Language

Keywords: Lemmatisation • Computational philology • Handwritten text recognition • Tagged corpora • Morphosyntactic analysis • Armenian

permalink http://doi.org/10.30687/arm/9372-8175/2022/01/005

Copyright: © 2022 Bastien Kindt, Chahan Vidal-Gorène. This is an open-access work distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction is permitted, provided that the original author(s) and the copyright owner(s) are credited and that the original publication is cited, in accordance with accepted academic practice. The license allows for commercial use. No use, distribution or reproduction is permitted which does not comply with these terms.

References to this chapter

How to cite