Abstract
We study the effect of tokenization on gender bias in machine translation, an aspect that has been largely overlooked in previous works. Specifically, we focus on the interactions between the frequency of gendered profession names in training data, their representation in the subword tokenizer’s vocabulary, and gender bias. We observe that female and non-stereotypical gender inflections of profession names (e.g., Spanish “doctora” for “female doctor”) tend to be split into multiple subword tokens. Our results indicate that the imbalance of gender forms in the model’s training corpus is a major factor contributing to gender bias and has a greater impact than subword splitting. We show that analyzing subword splits provides good estimates of gender-form imbalance in the training data and can be used even when the corpus is not publicly available. We also demonstrate that fine-tuning just the token embedding layer can decrease the gap in gender prediction accuracy between female and male forms without impairing the translation quality.
| Original language | English |
|---|---|
| Title of host publication | Long Papers |
| Editors | Jong C. Park, Yuki Arase, Baotian Hu, Wei Lu, Derry Wijaya, Ayu Purwarianti, Adila Alfa Krisnadhi |
| Publisher | Association for Computational Linguistics (ACL) |
| Pages | 885-896 |
| Number of pages | 12 |
| ISBN (Electronic) | 9798891760134 |
| DOIs | |
| State | Published - 2023 |
| Event | 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, IJCNLP-AACL 2023 - Bali, Indonesia Duration: 1 Nov 2023 → 4 Nov 2023 |
Publication series
| Name | Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: Long Papers, IJCNLP-AACL 2023 |
|---|---|
| Volume | 1 |
Conference
| Conference | 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, IJCNLP-AACL 2023 |
|---|---|
| Country/Territory | Indonesia |
| City | Bali |
| Period | 1/11/23 → 4/11/23 |
Bibliographical note
Publisher Copyright:© 2023 Association for Computational Linguistics.