(275) Scalable natural language processing of electronic health records to supplement large-scale covariate adjustment in pharmacoepidemiologic studies
Assistant Professor Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women’s Hospital, Boston, MA Boston, United States
Background: Unstructured electronic health records (EHR) remain underutilized for large-scale covariate adjustment in pharmacoepidemiologic studies. Natural language processing (NLP) tools turn free-text notes from EHR into data features. However, current applications are difficult to scale.
Objectives: To investigate the impact of supplementing claims data with unsupervised NLP-generated features for improved large-scale confounding control in 3 pharmacoepidemiologic studies.
Methods: We linked Medicare claims with EHR data to generate 3 cohort studies: Study 1) high- vs low-dose proton pump inhibitors on the risk of gastrointestinal bleeding; Study 2) high- versus low-intensity statins on the risk of major cardiovascular events; and Study 3) opioids vs. NSAIDs on the risk of renal failure. Based on domain knowledge, strong confounding was observed for each study in crude analyses and adjustment is expected to attenuate the effect estimates toward the null. We used ‘bag-of-words’ to generate features for the top 20,000 most prevalent terms from free text notes, to be contrasted with 71 researcher-specified variables, and empirical claims code selected by Lasso regression among thousands of candidate codes. We estimated the hazard ratio (HR) after adjustment using matching weights. We estimated the propensity score using Lasso regression that included different sets of candidate predictors: Set1 (71 researcher-specified variables), Set2 (Set1+Lasso-selected claims codes), Set 3 (Set2+Lasso-selected NLP-generated features).
Results: In Study 1, the unadjusted HR was 1.81 (1.54, 2.11). Adjustment using only researcher-specified variables resulted in a HR of 1.42 (1.19, 1.69). Adjusting for Lasso selected claims codes in addition to researcher-specified variables further attenuated the estimated HR to 1.29 (1.07, 1.58). There was little difference in the estimated HR after supplementing set 2 with additional adjustment using NLP generated features from free text notes, HR of 1.28 (1.05, 1.56). Similar trends, in terms of attenuation of estimated effects, were found across the other 2 cohort studies.
Conclusions: In 3 empirical studies, supplementing researcher-specified variables with large-scale covariate adjustment using Lasso selected claims codes moved the estimated treatment effect in a direction that was more consistent with expectations based on domain knowledge. However, additional adjustment from NLP generated features had no impact beyond adjustment for researcher-specified and lasso-selected claims codes. Future research will explore limitations which include exploring more sophisticated NLP approaches for large-scale feature generation and additional large-scale covariate selection methods beyond Lasso regression.