You Can Have Your Data and Balance It Too: Towards Balanced and Efficient Multilingual Models

Tomasz Limisiewicz, Dan Malkin, Gabriel Stanovsky

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

Multilingual models have been widely used for cross-lingual transfer to low-resource languages. However, the performance on these languages is hindered by their underrepresentation in the pretraining data. To alleviate this problem, we propose a novel multilingual training technique based on teacherstudent knowledge distillation. In this setting, we utilize monolingual teacher models optimized for their language. We use those teachers along with balanced (sub-sampled) data to distill the teachers knowledge into a single multilingual student. Our method outperforms standard training methods in lowresource languages and retains performance on high-resource languages.

Original languageEnglish
Title of host publicationSIGTYP 2023 - 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, Proceedings of the Workshop
EditorsLisa Beinborn, Koustava Goswami, Saliha Muradoglu, Alexey Sorokin, Ritesh Kumar, Andreas Shcherbakov, Edoardo M. Ponti, Ryan Cotterell, Ekaterina Vylomova
PublisherAssociation for Computational Linguistics
Pages1-11
Number of pages11
ISBN (Electronic)9781959429562
StatePublished - 2023
Event5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, SIGTYP 2023, co-located with the 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023 - Hybrid, Dubrovnik, Croatia
Duration: 6 May 2023 → …

Publication series

NameSIGTYP 2023 - 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, Proceedings of the Workshop

Conference

Conference5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, SIGTYP 2023, co-located with the 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023
Country/TerritoryCroatia
CityHybrid, Dubrovnik
Period6/05/23 → …

Bibliographical note

Publisher Copyright:
© 2023 Association for Computational Linguistics.

Fingerprint

Dive into the research topics of 'You Can Have Your Data and Balance It Too: Towards Balanced and Efficient Multilingual Models'. Together they form a unique fingerprint.

Cite this