TY - JOUR
T1 - Leveraging Researcher Domain Expertise to Annotate Concepts Within Imbalanced Data
AU - Markus, Dror K.
AU - Mor-Lan, Guy
AU - Sheafer, Tamir
AU - Shenhav, Shaul R.
N1 - Publisher Copyright:
© 2023 The Author(s). Published with license by Taylor & Francis Group, LLC.
PY - 2023
Y1 - 2023
N2 - As more computational communication researchers turn to supervised machine learning methods for text classification, we note the challenge in implementing such techniques within an imbalanced dataset. Such issues are critical in our domain, where, in many cases, researchers attempt to identify and study theoretically interesting categories that can be rare in a target corpus. Specifically, imbalanced distributions, with a skewed distribution of texts among the categories, can lead to a lengthy and expensive annotation stage, forcing practitioners to sample and label large numbers of texts to train a classification model. In this paper, we provide an overview of the issue, and describe existing strategies for mitigating such challenges. Noting the pitfalls of previous solutions, we then provide a semi-supervised method–Expert Initiated Latent Space Sampling–that complements researcher domain expertise with a systematic, unsupervised exploration of the latent semantic space to overcome such limitations. Utilizing simulations to systematically evaluate our method and compare it to existing approaches, we show that our procedure offers significant advantages in terms of efficiency and accuracy in many classification tasks.
AB - As more computational communication researchers turn to supervised machine learning methods for text classification, we note the challenge in implementing such techniques within an imbalanced dataset. Such issues are critical in our domain, where, in many cases, researchers attempt to identify and study theoretically interesting categories that can be rare in a target corpus. Specifically, imbalanced distributions, with a skewed distribution of texts among the categories, can lead to a lengthy and expensive annotation stage, forcing practitioners to sample and label large numbers of texts to train a classification model. In this paper, we provide an overview of the issue, and describe existing strategies for mitigating such challenges. Noting the pitfalls of previous solutions, we then provide a semi-supervised method–Expert Initiated Latent Space Sampling–that complements researcher domain expertise with a systematic, unsupervised exploration of the latent semantic space to overcome such limitations. Utilizing simulations to systematically evaluate our method and compare it to existing approaches, we show that our procedure offers significant advantages in terms of efficiency and accuracy in many classification tasks.
UR - http://www.scopus.com/inward/record.url?scp=85148623136&partnerID=8YFLogxK
U2 - 10.1080/19312458.2023.2182278
DO - 10.1080/19312458.2023.2182278
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:85148623136
SN - 1931-2458
VL - 17
SP - 250
EP - 271
JO - Communication Methods and Measures
JF - Communication Methods and Measures
IS - 3
ER -