We present models which complete missing text given transliterations of ancient Mesopotamian documents, originally written on cuneiform clay tablets (2500 BCE - 100 CE). Due to the tablets' deterioration, scholars often rely on contextual cues to manually fill in missing parts in the text in a subjective and time-consuming process. We identify that this challenge can be formulated as a masked language modelling task, used mostly as a pretraining objective for contextualized language models. Following, we develop several architectures focusing on the Akkadian language, the lingua franca of the time. We find that despite data scarcity (1M tokens) we can achieve state of the art performance on missing tokens prediction (89% hit@5) using a greedy decoding scheme and pretraining on data from other languages and different time periods. Finally, we conduct human evaluations showing the applicability of our models in assisting experts to transcribe texts in extinct languages.
|Original language||American English|
|Title of host publication||EMNLP 2021 - 2021 Conference on Empirical Methods in Natural Language Processing, Proceedings|
|Publisher||Association for Computational Linguistics (ACL)|
|Number of pages||10|
|State||Published - 2021|
|Event||2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021 - Virtual, Punta Cana, Dominican Republic|
Duration: 7 Nov 2021 → 11 Nov 2021
|Name||EMNLP 2021 - 2021 Conference on Empirical Methods in Natural Language Processing, Proceedings|
|Conference||2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021|
|City||Virtual, Punta Cana|
|Period||7/11/21 → 11/11/21|
Bibliographical noteFunding Information:
We thank Ethan Fetaya and Shai Gordin for insightful discussions and suggestions and the anonymous reviewers for their helpful comments and feedback. This work was supported in part by a research gift from the Allen Institute for AI.
© 2021 Association for Computational Linguistics