Journal | Armeniaca
Journal issue | 1 | 2022
Research Article | From Manuscript to Tagged Corpora
Abstract
Creating a digital corpus enriched by full linguistic annotations is a work which classically integrates several manual steps of acquisition, processing, and data display. Processing presupposes the existence of dedicated and specialised analysis tools, adapted to the state of the language used in the corpus. This paper describes a semi-supervised process for building Armenian corpora from scanned documents. This method is based on a chain of applications pre-trained by Calfa and GREgORI and enabling the complete processing of texts, from their automated input to their linguistic analysis and data display. We provide an assessment of this methodology and benefits of model specialisation, based on digitised copies of a 17th-century manuscript of the Four Gospels (Walters MS W541 = BAL W541, Amida Gospels, ff. 113v-117r: Lk 1:1‑78).
Submitted: Dec. 20, 2021 | Accepted: March 23, 2022 | Published Oct. 28, 2022 | Language: en
Keywords Morphosyntactic analysis • Computational philology • Tagged corpora • Lemmatisation • Armenian • Handwritten text recognition
Copyright © 2022 Bastien Kindt, Chahan Vidal-Gorène. This is an open-access work distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction is permitted, provided that the original author(s) and the copyright owner(s) are credited and that the original publication is cited, in accordance with accepted academic practice. The license allows for commercial use. No use, distribution or reproduction is permitted which does not comply with these terms.
Permalink http://doi.org/10.30687/arm/9372-8175/2022/01/005