Rivista | Armeniaca
Fascicolo | 1 | 2022
Articolo | From Manuscript to Tagged Corpora

From Manuscript to Tagged Corpora

An Automated Process for Ancient Armenian or Other Under-Resourced Languages of the Christian East

Bastien Kindt - Université Catholique de Louvain, Belgique - email
Chahan Vidal-Gorène - École Nationale des Chartes, Paris - email

Abstract

Creating a digital corpus enriched by full linguistic annotations is a work which classically integrates several manual steps of acquisition, processing, and data display. Processing presupposes the existence of dedicated and specialised analysis tools, adapted to the state of the language used in the corpus. This paper describes a semi-supervised process for building Armenian corpora from scanned documents. This method is based on a chain of applications pre-trained by Calfa and GREgORI and enabling the complete processing of texts, from their automated input to their linguistic analysis and data display. We provide an assessment of this methodology and benefits of model specialisation, based on digitised copies of a 17th-century manuscript of the Four Gospels (Walters MS W541 = BAL W541, Amida Gospels, ff. 113v-117r: Lk 1:1‑78).

Open access | Peer reviewed

Presentato: 20 Dicembre 2021 | Accettato: 23 Marzo 2022 | Pubblicato 28 Ottobre 2022 | Lingua: en

Keywords Computational philology • Handwritten text recognition • Lemmatisation • Morphosyntactic analysis • Armenian • Tagged corpora

Copyright © 2022 Bastien Kindt, Chahan Vidal-Gorène. This is an open-access work distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction is permitted, provided that the original author(s) and the copyright owner(s) are credited and that the original publication is cited, in accordance with accepted academic practice. The license allows for commercial use. No use, distribution or reproduction is permitted which does not comply with these terms.

Permalink http://doi.org/10.30687/arm/9372-8175/2022/01/005

↑ back to top

Leggi questo articolo

Note from the Editors-in-Chief
Aldo Ferrari, Alessandro Orengo, Zara Pogossian, Anna Sirinian
28 Ottobre 2022

Introduction
Armenia(n) Through the Ages
Robin Meyer, Irene Tinti
28 Ottobre 2022

Շքակոխեմ զմեր զփրկութիւնն, hapax nella traduzione armena dell’Epideixis di Sant’Ireneo di Lione: ‘gettare sopra come ombra la nostra salvezza’
Clara Sanvito
28 Ottobre 2022

The Anonymous Saint in the Armenian Tradition
Alexi(an)os the Voluntary Pauper or the Anonymous ‘Man of God’?
Anna Rogozhina
28 Ottobre 2022

The Poetic Middle Armenian of Kafas in the Alexander Romance
Alex MacFarlane
28 Ottobre 2022

A Brief Introduction to Harsnerēn
Carla Kekejian
28 Ottobre 2022

From Manuscript to Tagged Corpora
An Automated Process for Ancient Armenian or Other Under-Resourced Languages of the Christian East
Bastien Kindt, Chahan Vidal-Gorène
28 Ottobre 2022

A New Look at Old Armenisms in Kartvelian
Rasmus Thorsø
28 Ottobre 2022

Classical Armenian Deixis
Issues of Translation
Hana Aghababian
28 Ottobre 2022

Grammaticalization of the Definite Article in Armenian
Katherine Hodgson
28 Ottobre 2022

The Forms of the Indefinite Article in Eastern Armenian
Pre-Modern, Early and Colloquial Eastern Armenian Sources
Hasmik Sargsyan
28 Ottobre 2022

Constructions clivées en arménien moderne
Victoria Khurshudyan, Anaïd Donabedian
28 Ottobre 2022

The Armenian-Italian Joint Expedition at Dvin
Report of 2021 Activities
Hamlet Petrosyan, Michele Nucciotti, Elisa Pruno, Leonardo Squilloni, Lyuba Kirakosyan, Tatyana Vardanesova
28 Ottobre 2022

DC Field	Value
dc.identifier	ECF_article_7618
dc.title	From Manuscript to Tagged Corpora. An Automated Process for Ancient Armenian or Other Under-Resourced Languages of the Christian East
dc.contributor.author	Kindt Bastien
dc.contributor.author	Vidal-Gorène Chahan
dc.publisher	Edizioni Ca’ Foscari - Venice University Press, Fondazione Università Ca’ Foscari
dc.type	Articolo
dc.language.iso	en
dc.identifier.uri	http://edizionicafoscari.it/it/edizioni4/riviste/armeniaca/2022/1/from-manuscript-to-tagged-corpora/
dc.description.abstract	Creating a digital corpus enriched by full linguistic annotations is a work which classically integrates several manual steps of acquisition, processing, and data display. Processing presupposes the existence of dedicated and specialised analysis tools, adapted to the state of the language used in the corpus. This paper describes a semi-supervised process for building Armenian corpora from scanned documents. This method is based on a chain of applications pre-trained by Calfa and GREgORI and enabling the complete processing of texts, from their automated input to their linguistic analysis and data display. We provide an assessment of this methodology and benefits of model specialisation, based on digitised copies of a 17th-century manuscript of the Four Gospels (Walters MS W541 = BAL W541, Amida Gospels, ff. 113v-117r: Lk 1:1‑78).
dc.relation.ispartof	Armeniaca
dc.relation.ispartof	Vol. 1 – Ottobre 2022
dc.issued	2022-10-28
dc.dateAccepted	2022-03-23
dc.dateSubmitted	2021-12-20
dc.identifier.issn
dc.identifier.eissn	2974-6051
dc.rights	Creative Commons Attribution 4.0 International Public License
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/
dc.identifier.doi	10.30687/arm/9372-8175/2022/01/005
dc.peer-review	yes
dc.subject	Armenian
dc.subject	Computational philology
dc.subject	Handwritten text recognition
dc.subject	Lemmatisation
dc.subject	Morphosyntactic analysis
dc.subject	Tagged corpora
	Download data

download

articoli nello stesso fascicolo
citazioni
how to cite
metadati
open access
peer reviewed

Armeniaca International Journal of Armenian Studies

From Manuscript to Tagged Corpora

An Automated Process for Ancient Armenian or Other Under-Resourced Languages of the Christian East

Sommario

Armeniaca International Journal of Armenian Studies

From Manuscript to Tagged Corpora

An Automated Process for Ancient Armenian or Other Under-Resourced Languages of the Christian East

+ Bastien Kindt, Chahan Vidal-Gorène

+ Note

+ Figure

+ Tabelle

+ Articoli nello stesso fascicolo

+ Citazioni

+ How to cite

+ Metadati

+ Informazioni

+ Condividi

Sommario