Journal | Armeniaca
Journal issue | 1 | 2022
Research Article | From Manuscript to Tagged Corpora

From Manuscript to Tagged Corpora

An Automated Process for Ancient Armenian or Other Under-Resourced Languages of the Christian East

Bastien Kindt - Université Catholique de Louvain, Belgique - email
Chahan Vidal-Gorène - École Nationale des Chartes, Paris - email

Abstract

Creating a digital corpus enriched by full linguistic annotations is a work which classically integrates several manual steps of acquisition, processing, and data display. Processing presupposes the existence of dedicated and specialised analysis tools, adapted to the state of the language used in the corpus. This paper describes a semi-supervised process for building Armenian corpora from scanned documents. This method is based on a chain of applications pre-trained by Calfa and GREgORI and enabling the complete processing of texts, from their automated input to their linguistic analysis and data display. We provide an assessment of this methodology and benefits of model specialisation, based on digitised copies of a 17th-century manuscript of the Four Gospels (Walters MS W541 = BAL W541, Amida Gospels, ff. 113v-117r: Lk 1:1‑78).

Open access | Peer reviewed

Submitted: Dec. 20, 2021 | Accepted: March 23, 2022 | Published Oct. 28, 2022 | Language: en

Keywords Handwritten text recognition • Computational philology • Lemmatisation • Armenian • Tagged corpora • Morphosyntactic analysis

Copyright © 2022 Bastien Kindt, Chahan Vidal-Gorène. This is an open-access work distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction is permitted, provided that the original author(s) and the copyright owner(s) are credited and that the original publication is cited, in accordance with accepted academic practice. The license allows for commercial use. No use, distribution or reproduction is permitted which does not comply with these terms.

Permalink http://doi.org/10.30687/arm/9372-8175/2022/01/005

↑ back to top

Note from the Editors-in-Chief
Aldo Ferrari, Alessandro Orengo, Zara Pogossian, Anna Sirinian
Oct. 28, 2022

Introduction
Armenia(n) Through the Ages
Robin Meyer, Irene Tinti
Oct. 28, 2022

The Anonymous Saint in the Armenian Tradition
Alexi(an)os the Voluntary Pauper or the Anonymous ‘Man of God’?
Anna Rogozhina
Oct. 28, 2022

The Poetic Middle Armenian of Kafas in the Alexander Romance
Alex MacFarlane
Oct. 28, 2022

A Brief Introduction to Harsnerēn
Carla Kekejian
Oct. 28, 2022

From Manuscript to Tagged Corpora
An Automated Process for Ancient Armenian or Other Under-Resourced Languages of the Christian East
Bastien Kindt, Chahan Vidal-Gorène
Oct. 28, 2022

A New Look at Old Armenisms in Kartvelian
Rasmus Thorsø
Oct. 28, 2022

Classical Armenian Deixis
Issues of Translation
Hana Aghababian
Oct. 28, 2022

Grammaticalization of the Definite Article in Armenian
Katherine Hodgson
Oct. 28, 2022

The Forms of the Indefinite Article in Eastern Armenian
Pre-Modern, Early and Colloquial Eastern Armenian Sources
Hasmik Sargsyan
Oct. 28, 2022

Constructions clivées en arménien moderne
Victoria Khurshudyan, Anaïd Donabedian
Oct. 28, 2022

The Armenian-Italian Joint Expedition at Dvin
Report of 2021 Activities
Hamlet Petrosyan, Michele Nucciotti, Elisa Pruno, Leonardo Squilloni, Lyuba Kirakosyan, Tatyana Vardanesova
Oct. 28, 2022

DC Field	Value
dc.identifier	ECF_article_7618
dc.title	From Manuscript to Tagged Corpora. An Automated Process for Ancient Armenian or Other Under-Resourced Languages of the Christian East
dc.contributor.author	Kindt Bastien
dc.contributor.author	Vidal-Gorène Chahan
dc.publisher	Edizioni Ca’ Foscari - Venice University Press, Fondazione Università Ca’ Foscari
dc.type	Research Article
dc.language.iso	en
dc.identifier.uri	http://edizionicafoscari.it/en/edizioni4/riviste/armeniaca/2022/1/from-manuscript-to-tagged-corpora/
dc.description.abstract	Creating a digital corpus enriched by full linguistic annotations is a work which classically integrates several manual steps of acquisition, processing, and data display. Processing presupposes the existence of dedicated and specialised analysis tools, adapted to the state of the language used in the corpus. This paper describes a semi-supervised process for building Armenian corpora from scanned documents. This method is based on a chain of applications pre-trained by Calfa and GREgORI and enabling the complete processing of texts, from their automated input to their linguistic analysis and data display. We provide an assessment of this methodology and benefits of model specialisation, based on digitised copies of a 17th-century manuscript of the Four Gospels (Walters MS W541 = BAL W541, Amida Gospels, ff. 113v-117r: Lk 1:1‑78).
dc.relation.ispartof	Armeniaca
dc.relation.ispartof	Vol. 1 – Ottobre 2022
dc.issued	2022-10-28
dc.dateAccepted	2022-03-23
dc.dateSubmitted	2021-12-20
dc.identifier.issn
dc.identifier.eissn	2974-6051
dc.rights	Creative Commons Attribution 4.0 International Public License
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/
dc.identifier.doi	10.30687/arm/9372-8175/2022/01/005
dc.peer-review	yes
dc.subject	Armenian
dc.subject	Computational philology
dc.subject	Handwritten text recognition
dc.subject	Lemmatisation
dc.subject	Morphosyntactic analysis
dc.subject	Tagged corpora
	Download data

From Manuscript to Tagged Corpora

+ Bastien Kindt, Chahan Vidal-Gorène

+ Footnotes

+ Figures

+ Tables

+ Articles in the same issue

+ Citations

+ How to cite

+ Metadata