TAJA Corpus: Linguistically Tagged Written Algerian Judeo-Arabic Corpus

Ofra Tirosh-Becker, Oren M. Becker

Research output: Contribution to journalArticlepeer-review

Abstract

The Tagged Algerian Judeo-Arabic (TAJA) corpus is the first linguistically annotated corpus of any Judeo-Arabic dialect regardless of geography and period. The corpus is a genre-diverse collection of written Modern Algerian Judeo-Arabic texts, encompassing translations of the Bible and of liturgical texts, commentaries and original Judeo-Arabic books and journals. The TAJA corpus was manually annotated with parts-of-speech (POS) tags and detailed morphology tags. The goal of the new corpus is twofold. First, it preserves this endangered Judeo-Arabic language, expanding on previous fieldwork and going beyond the study of individual written texts. The corpus has already enabled us to make strides towards a grammar of written Algerian Judeo-Arabic. Second, this tagged corpus serves as a foundation for the development of Judeo-Arabic-specific Natural Language Processing (NLP) tools, which allow automatic POS tagging and morphological annotation of large collections of yet untapped texts in Algerian Judeo-Arabic and other Judeo-Arabic varieties.

Original languageEnglish
Pages (from-to)24-53
Number of pages30
JournalJournal of Jewish Languages
Volume10
Issue number1
DOIs
StatePublished - 2022

Bibliographical note

Publisher Copyright:
© 2022 Copyright 2022 by Koninklijke Brill NV, Leiden, The Netherlands.

Keywords

  • Algeria
  • Judeo-Arabic
  • corpus linguistics
  • digital humanities
  • linguistic tagging
  • natural language processing (NLP)

Fingerprint

Dive into the research topics of 'TAJA Corpus: Linguistically Tagged Written Algerian Judeo-Arabic Corpus'. Together they form a unique fingerprint.

Cite this