The paper describes the creation of the first open access multi-genre historical corpus of Emergent Modern Hebrew, made possible by implementation of digital humanities methods in the process of corpus curation, encoding, and dissemination. Corpus contents originate in the Ben-Yehuda Project, an open access repository of Hebrew literature online, and in digital images curated from the collections of the National Library of Israel, a selection of which have been transcribed through a dedicated crowdsourcing task that feeds back into the library’s online catalog. Texts in the corpus are encoded following best practices in the digital humanities, including markup of metadata that enables time-sensitive research, linguistic and other, of the corpus. Evaluation of morphological analysis based on Modern Hebrew language models is shown to distinguish between genres in the historical variety, highlighting the importance of ephemeral materials for linguistic research and for potential collaboration with libraries and cultural institutions in the process of corpus creation. We demonstrate the use of the corpus in diachronic linguistic research and suggest ways in which the association it provides between digital images and texts can be used to support automatic language processing and to enhance resources in the digital humanities.
Bibliographical noteFunding Information:
I wish to thank the three anonymous reviewers of this manuscript for their helpful comments. For invaluable discussion and feedback during all stages of the project, I am grateful to Sinai Rusinek. Thanks also to Meni Adler, Maayan Almagor, Yael Netzer, Avigail Tsirkin-Sadan, and Amir Zeldes. This research was supported by the Mandel Scholion Interdisciplinary Research Center in the Humanities and Jewish Studies at the Hebrew University of Jerusalem. I thank researchers at the Center for their support, especially Yael Reshef for enabling me to train research assistants of the “Emergence of Modern Hebrew” research group in the TEI format. Programming support by Itay Zandbank of The Research Software Company (https://www.chelem.co.il) is also gratefully acknowledged.
© 2019, Springer Nature B.V.
- Citizen science
- Digital humanities
- Historical corpora
- Language change