Exploring the Impact of Training Data Distribution and Subword Tokenization on Gender Bias in Machine Translation

  • Bar Eluz*
  • , Tomasz Limisiewicz
  • , Gabriel Stanovsky
  • , David Mareček
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

We study the effect of tokenization on gender bias in machine translation, an aspect that has been largely overlooked in previous works. Specifically, we focus on the interactions between the frequency of gendered profession names in training data, their representation in the subword tokenizer’s vocabulary, and gender bias. We observe that female and non-stereotypical gender inflections of profession names (e.g., Spanish “doctora” for “female doctor”) tend to be split into multiple subword tokens. Our results indicate that the imbalance of gender forms in the model’s training corpus is a major factor contributing to gender bias and has a greater impact than subword splitting. We show that analyzing subword splits provides good estimates of gender-form imbalance in the training data and can be used even when the corpus is not publicly available. We also demonstrate that fine-tuning just the token embedding layer can decrease the gap in gender prediction accuracy between female and male forms without impairing the translation quality.

Original languageEnglish
Title of host publicationLong Papers
EditorsJong C. Park, Yuki Arase, Baotian Hu, Wei Lu, Derry Wijaya, Ayu Purwarianti, Adila Alfa Krisnadhi
PublisherAssociation for Computational Linguistics (ACL)
Pages885-896
Number of pages12
ISBN (Electronic)9798891760134
DOIs
StatePublished - 2023
Event13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, IJCNLP-AACL 2023 - Bali, Indonesia
Duration: 1 Nov 20234 Nov 2023

Publication series

NameProceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: Long Papers, IJCNLP-AACL 2023
Volume1

Conference

Conference13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, IJCNLP-AACL 2023
Country/TerritoryIndonesia
CityBali
Period1/11/234/11/23

Bibliographical note

Publisher Copyright:
© 2023 Association for Computational Linguistics.

Fingerprint

Dive into the research topics of 'Exploring the Impact of Training Data Distribution and Subword Tokenization on Gender Bias in Machine Translation'. Together they form a unique fingerprint.

Cite this