TY - JOUR
T1 - Short-Term Prediction Model for Breast Cancer Risk Based on One Million Medical Records
AU - Feinstein, Ofer
AU - Ofer, Dan
AU - Bachmat, Eitan
AU - Gazit, Sivan
AU - Linial, Michal
AU - Menes, Tehillah S.
N1 - Publisher Copyright:
© 2025 The Authors
PY - 2025
Y1 - 2025
N2 - Background: Despite progress in breast cancer screening many women are diagnosed with advanced stage. We sought to develop a short-term (one year) prediction model for breast cancer risk, based on readily available data from electronic medical records (EMRs), to support decision-making. Methods: A retrospective cohort study using data of 1,039,212 members of a large healthcare organization between the years 1985 and 2021. During the study years, 18,959 people were diagnosed with breast cancer. Longitudinal personal medical information such as demographics, cancer-related family history, smoking habits, medical history, fertility treatments, surgeries, biopsies, medications, BMI, blood pressure and lab tests was used to predict the outcome: breast cancer diagnosis one year from the recorded data. Prediction models were trained using the CatBoost decision tree methodology. SHapley Additive exPlanations (SHAP) values were used to estimate the marginal impact of a feature on the model performance, considering the other features. Results: The model includes numerous features not utilized in existing breast cancer risk models (e.g., medications, systolic blood pressure, TSH levels and more), available from the EMR. The informative features, ranked by SHAP values, include age, the number of surgical consultations and the number of breast biopsies. The model achieved high performance with an area under the ROC curve (AUC-ROC) of 0.85. Conclusions: Use of data readily available from the EMR, can assist clinicians when assessing the short-term breast cancer risk.
AB - Background: Despite progress in breast cancer screening many women are diagnosed with advanced stage. We sought to develop a short-term (one year) prediction model for breast cancer risk, based on readily available data from electronic medical records (EMRs), to support decision-making. Methods: A retrospective cohort study using data of 1,039,212 members of a large healthcare organization between the years 1985 and 2021. During the study years, 18,959 people were diagnosed with breast cancer. Longitudinal personal medical information such as demographics, cancer-related family history, smoking habits, medical history, fertility treatments, surgeries, biopsies, medications, BMI, blood pressure and lab tests was used to predict the outcome: breast cancer diagnosis one year from the recorded data. Prediction models were trained using the CatBoost decision tree methodology. SHapley Additive exPlanations (SHAP) values were used to estimate the marginal impact of a feature on the model performance, considering the other features. Results: The model includes numerous features not utilized in existing breast cancer risk models (e.g., medications, systolic blood pressure, TSH levels and more), available from the EMR. The informative features, ranked by SHAP values, include age, the number of surgical consultations and the number of breast biopsies. The model achieved high performance with an area under the ROC curve (AUC-ROC) of 0.85. Conclusions: Use of data readily available from the EMR, can assist clinicians when assessing the short-term breast cancer risk.
KW - Big data
KW - CatBoost
KW - Machine learning
KW - Risk model
UR - https://www.scopus.com/pages/publications/105014101810
U2 - 10.1016/j.clbc.2025.07.025
DO - 10.1016/j.clbc.2025.07.025
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
C2 - 40849238
AN - SCOPUS:105014101810
SN - 1526-8209
JO - Clinical Breast Cancer
JF - Clinical Breast Cancer
ER -