TY - JOUR
T1 - Multi-Dimensional Validation of the Integration of Syntactic and Semantic Distance Measures for Clustering Fibromyalgia Patients in the Rheumatic Monitor Big Data Study
AU - Goldstein, Ayelet
AU - Shahar, Yuval
AU - Weisman Raymond, Michal
AU - Peleg, Hagit
AU - Ben-Chetrit, Eldad
AU - Ben-Yehuda, Arie
AU - Shalom, Erez
AU - Goldstein, Chen
AU - Shiloh, Shmuel Shay
AU - Almoznino, Galit
N1 - Publisher Copyright:
© 2024 by the authors.
PY - 2024/1
Y1 - 2024/1
N2 - This study primarily aimed at developing a novel multi-dimensional methodology to discover and validate the optimal number of clusters. The secondary objective was to deploy it for the task of clustering fibromyalgia patients. We present a comprehensive methodology that includes the use of several different clustering algorithms, quality assessment using several syntactic distance measures (the Silhouette Index (SI), Calinski–Harabasz index (CHI), and Davies–Bouldin index (DBI)), stability assessment using the adjusted Rand index (ARI), and the validation of the internal semantic consistency of each clustering option via the performance of multiple clustering iterations after the repeated bagging of the data to select multiple partial data sets. Then, we perform a statistical analysis of the (clinical) semantics of the most stable clustering options using the full data set. Finally, the results are validated through a supervised machine learning (ML) model that classifies the patients back into the discovered clusters and is interpreted by calculating the Shapley additive explanations (SHAP) values of the model. Thus, we refer to our methodology as the clustering, distance measures and iterative statistical and semantic validation (CDI-SSV) methodology. We applied our method to the analysis of a comprehensive data set acquired from 1370 fibromyalgia patients. The results demonstrate that the K-means was highly robust in the syntactic and the internal consistent semantics analysis phases and was therefore followed by a semantic assessment to determine the optimal number of clusters (k), which suggested k = 3 as a more clinically meaningful solution, representing three distinct severity levels. the random forest model validated the results by classification into the discovered clusters with high accuracy (AUC: 0.994; accuracy: 0.946). SHAP analysis emphasized the clinical relevance of "functional problems" in distinguishing the most severe condition. In conclusion, the CDI-SSV methodology offers significant potential for improving the classification of complex patients. Our findings suggest a classification system for different profiles of fibromyalgia patients, which has the potential to improve clinical care, by providing clinical markers for the evidence-based personalized diagnosis, management, and prognosis of fibromyalgia patients.
AB - This study primarily aimed at developing a novel multi-dimensional methodology to discover and validate the optimal number of clusters. The secondary objective was to deploy it for the task of clustering fibromyalgia patients. We present a comprehensive methodology that includes the use of several different clustering algorithms, quality assessment using several syntactic distance measures (the Silhouette Index (SI), Calinski–Harabasz index (CHI), and Davies–Bouldin index (DBI)), stability assessment using the adjusted Rand index (ARI), and the validation of the internal semantic consistency of each clustering option via the performance of multiple clustering iterations after the repeated bagging of the data to select multiple partial data sets. Then, we perform a statistical analysis of the (clinical) semantics of the most stable clustering options using the full data set. Finally, the results are validated through a supervised machine learning (ML) model that classifies the patients back into the discovered clusters and is interpreted by calculating the Shapley additive explanations (SHAP) values of the model. Thus, we refer to our methodology as the clustering, distance measures and iterative statistical and semantic validation (CDI-SSV) methodology. We applied our method to the analysis of a comprehensive data set acquired from 1370 fibromyalgia patients. The results demonstrate that the K-means was highly robust in the syntactic and the internal consistent semantics analysis phases and was therefore followed by a semantic assessment to determine the optimal number of clusters (k), which suggested k = 3 as a more clinically meaningful solution, representing three distinct severity levels. the random forest model validated the results by classification into the discovered clusters with high accuracy (AUC: 0.994; accuracy: 0.946). SHAP analysis emphasized the clinical relevance of "functional problems" in distinguishing the most severe condition. In conclusion, the CDI-SSV methodology offers significant potential for improving the classification of complex patients. Our findings suggest a classification system for different profiles of fibromyalgia patients, which has the potential to improve clinical care, by providing clinical markers for the evidence-based personalized diagnosis, management, and prognosis of fibromyalgia patients.
KW - Big Data
KW - K-means
KW - cluster analysis
KW - fibromyalgia
KW - machine learning algorithm
KW - rheumatic diseases
UR - https://www.scopus.com/pages/publications/85183140421
U2 - 10.3390/bioengineering11010097
DO - 10.3390/bioengineering11010097
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
C2 - 38275577
AN - SCOPUS:85183140421
SN - 2306-5354
VL - 11
JO - Bioengineering
JF - Bioengineering
IS - 1
M1 - 97
ER -