Abstract
Background: Despite progress in breast cancer screening many women are diagnosed with advanced stage. We sought to develop a short-term (one year) prediction model for breast cancer risk, based on readily available data from electronic medical records (EMRs), to support decision-making. Methods: A retrospective cohort study using data of 1,039,212 members of a large healthcare organization between the years 1985 and 2021. During the study years, 18,959 people were diagnosed with breast cancer. Longitudinal personal medical information such as demographics, cancer-related family history, smoking habits, medical history, fertility treatments, surgeries, biopsies, medications, BMI, blood pressure and lab tests was used to predict the outcome: breast cancer diagnosis one year from the recorded data. Prediction models were trained using the CatBoost decision tree methodology. SHapley Additive exPlanations (SHAP) values were used to estimate the marginal impact of a feature on the model performance, considering the other features. Results: The model includes numerous features not utilized in existing breast cancer risk models (e.g., medications, systolic blood pressure, TSH levels and more), available from the EMR. The informative features, ranked by SHAP values, include age, the number of surgical consultations and the number of breast biopsies. The model achieved high performance with an area under the ROC curve (AUC-ROC) of 0.85. Conclusions: Use of data readily available from the EMR, can assist clinicians when assessing the short-term breast cancer risk.
| Original language | English |
|---|---|
| Pages (from-to) | 774-781.e4 |
| Journal | Clinical Breast Cancer |
| Volume | 25 |
| Issue number | 8 |
| DOIs | |
| State | Published - Dec 2025 |
Bibliographical note
Publisher Copyright:© 2025 The Authors
UN SDGs
This output contributes to the following UN Sustainable Development Goals (SDGs)
-
SDG 3 Good Health and Well-being
Keywords
- Big data
- CatBoost
- Machine learning
- Risk model
Fingerprint
Dive into the research topics of 'Short-Term Prediction Model for Breast Cancer Risk Based on One Million Medical Records'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver