(245) Evaluation of Handling Degrees of Missingness in the Features of Machine Learning Algorithms to Predict Overall Survival Using Real-World Lung Cancer Data
Director Daiichi Sankyo Basking Ridge, United States
Background: Machine learning (ML) is an analytic method for real-world data but often faces a problem of missing values to varying degrees. There has been little information available on how best to handle levels of missingness among features for ML, such as in the prediction of overall survival (OS) in advanced lung cancer (mLC).
Objectives: To evaluate approaches to handle varying degrees of missingness among features for ML in the prediction of OS for the mLC.
Methods: ML algorithms were trained and validated for predicting 90-day mortality in a retrospective cohort of adults from first recorded diagnosis of mLC in the large nationwide IQVIA oncology EMR US in 2015-2021. Seventy nine baseline features including demographics, vital signs, stage, TNM, histology, biomarkers (e.g., EGFR, HER2, KRAS, BRAF, and cMET), chemo, target- and immunotherapy, and functional labs were assessed. For the basic scenario, steps were conducted: i) Data were cleaned by removing values with extreme outliers; ii) Single imputation with the median for continuous and new missing category creation for categorical features; iii) From the full cohort (Cfull), three additional analytic cohorts (C25, C50 and C75) to keep features with respective missingness proportions < 25%, < 50% and < 75% for continuous variables were created; iv) Each cohort was split into 70/30 for training and testing; v) ML including Random Forest (RF) and eXtreme Gradient Boosted Tree (XGBoost) were used. Metrics to evaluate the performance of ML models included Area Under the Curve (AUC), accuracy, logloss, RSME, KS and F1 score.
Results: Among 19,751 mLC adults included in the full study cohort, 31.2% were 75+ years old (median=69 and IQR=62-76 years) and 52.2% were male. Stages IIIB, IIIC and IV were characterized in 12.3%, 1.4% and 86.3% of patients, respectively. 9% of patients died within the 90-day follow-up. The number of features dropped from 79 in the full cohort to 58, 52 and 43 for the C75, C50 and C25 cohorts, respectively. AUC of 0.79, 0.77, 0.75 and 0.74 from XGBoost; and 0.73, 0.71, 0.70 and 0.58 from RF were observed for the 4 Cfull, C75, C50 and C25 cohorts, respectively. Similar AUC trends were found from different ML models including AVG blender, light GBM, residual neural networks; as well as models with various parameters, for the 4 cohorts. Detailed results for different scenarios and ML models will be discussed.
Conclusions: Based on the AUC performance, the proposed approach with removing extreme outliers, single imputation with median and including all features regardless of missingness levels into ML models performed best in the prediction of mortality of advanced lung cancer patients in this study. Further research may be needed to confirm this finding for different missing imputation methods, diseases and databases.