NLP in the DH pipeline: Transfer-learning to a Chronolect

Aynat Rubinstein, Avi Shmidman

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review


A big unknown in Digital Humanities (DH) projects that seek to analyze previously untouched corpora is the question of how to adapt existing Natural Language Processing (NLP) resources to the specific nature of the target corpus. In this paper, we study the case of Emergent Modern Hebrew (EMH), an under-resourced chronolect of the Hebrew language. The resource we seek to adapt, a diacritizer, exists for both earlier and later chronolects of the language. Given a small annotated corpus of our target chronolect, we demonstrate that applying transfer-learning from either of the chronolects is preferable to training a new model from scratch. Furthermore, we consider just how much annotated data is necessary. For our task, we find that even a minimal corpus of 50K tokens provides a noticeable gain in accuracy. At the same time, we also evaluate accuracy at three additional increments, in order to quantify the gains that can be expected by investing in a larger annotated corpus.
Original languageAmerican English
Title of host publicationProceedings of the Workshop on Natural Language Processing for Digital Humanities
PublisherNLP Association of India (NLPAI)
Number of pages5
StatePublished - 2021


Dive into the research topics of 'NLP in the DH pipeline: Transfer-learning to a Chronolect'. Together they form a unique fingerprint.

Cite this