ALIGN WITH PURPOSE: OPTIMIZE DESIRED PROPERTIES IN CTC MODELS WITH A GENERAL PLUG-AND-PLAY FRAMEWORK

Eliya Segev, Maya Alroy, Ronen Katsir, Noam Wies, Ayana Shenhav, Yael Ben-Oren, David Zar, Oren Tadmor, Jacob Bitterman, Amnon Shashua, Tal Rosenwein

Research output: Contribution to conferencePaperpeer-review

Abstract

Connectionist Temporal Classification (CTC) is a widely used criterion for training supervised sequence-to-sequence (seq2seq) models. It learns the alignments between the input and output sequences by marginalizing over the perfect alignments (that yield the ground truth), at the expense of the imperfect ones. This dichotomy, and in particular the equal treatment of all perfect alignments, results in a lack of controllability over the predicted alignments. This controllability is essential for capturing properties that hold significance in real-world applications. Here we propose Align With Purpose (AWP), a general Plug-and-Play framework for enhancing a desired property in models trained with the CTC criterion. We do that by complementing the CTC loss with an additional loss term that prioritizes alignments according to a desired property. AWP does not require any intervention in the CTC loss function, and allows to differentiate between both perfect and imperfect alignments for a variety of properties. We apply our framework in the domain of Automatic Speech Recognition (ASR) and show its generality in terms of property selection, architectural choice, and scale of the training dataset (up to 280,000 hours). To demonstrate the effectiveness of our framework, we apply it to two unrelated properties: token emission time for latency optimization and word error rate (WER). For the former, we report an improvement of up to 590ms in latency optimization with a minor reduction in WER, and for the latter, we report a relative improvement of 4.5% in WER over the baseline models. To the best of our knowledge, these applications have never been demonstrated to work on this scale of data. Notably, our method can be easily implemented using only a few lines of code and can be extended to other alignment-free loss functions and to domains other than ASR.

Original languageEnglish
StatePublished - 2024
Event12th International Conference on Learning Representations, ICLR 2024 - Hybrid, Vienna, Austria
Duration: 7 May 202411 May 2024

Conference

Conference12th International Conference on Learning Representations, ICLR 2024
Country/TerritoryAustria
CityHybrid, Vienna
Period7/05/2411/05/24

Bibliographical note

Publisher Copyright:
© 2024 12th International Conference on Learning Representations, ICLR 2024. All rights reserved.

Fingerprint

Dive into the research topics of 'ALIGN WITH PURPOSE: OPTIMIZE DESIRED PROPERTIES IN CTC MODELS WITH A GENERAL PLUG-AND-PLAY FRAMEWORK'. Together they form a unique fingerprint.

Cite this