Actuarial Analyst Milliman, Inc. Windsor, United States
Background: It is becoming increasingly important to integrate healthcare claims with beneficiaries’ perspectives and their social determinants of health (SDoH) but is challenging with disparate data sources.
Objectives: To explore the Medicare Current Beneficiary Survey (MCBS) dataset and demonstrate the feasibility and value of survey data to predict a diagnosis of depression.
Methods: We created end-to-end predictive models using survey responses from Medicare beneficiaries linked to their Medicare healthcare claims data in the MCBS Limited Dataset for 2018 to predict if a member reports a depression diagnosis in 2019.
We performed the following data preparation steps: (1) categorical variables were dummy encoded or one hot encoded; (2) variables with more than 50% of observations that were null (primarily due to skip logic) were removed; (3) missing values for remaining variables were imputed to the mean for quantitative variables and the median for categorical variables; and (4) variables that might contribute to data leakage or were directly correlated to a depression diagnosis in 2018, such as taking depression medication, were removed.
We evaluated multiple types of machine learning models to predict binary classifications- where a prediction of class 1 indicates a reported diagnosis of depression- including logistic regression, random forests, and gradient boosting. We assessed the efficacy of each model using AUC-ROC, accuracy, precision, recall, and f-1 score.
Results: We found that the gradient boosting and random forest models had the most predictive power overall compared to logistic regression. The most impactful features to the model included beneficiary age, whether someone was told to lose weight by their doctor, and whether a person was diagnosed with certain non-depressive mental conditions or physical illnesses.
Conclusions: The MCBS is a vast and relatively untapped source of data with unique variables that can aid to further understand drivers of healthcare and predict outcomes.
With data of this size and variety (2,000 variables for which the majority are categorical), there are many alternatives to data selection, processing, feature engineering and model tuning that could be applied. It is probable that approaching any of these differently may result in different predictive performance.
This analysis has been prepared for the specific purpose of exploring MCBS and demonstrating the possibilities of its use for predictive modeling. This information may not be appropriate, and should not be used, for any other purpose. Milliman does not intend to benefit or create a legal duty to any third party recipient of its work even if we permit the distribution of our work product to such third party.