Head of Biostatistics & Analytics Genesis Research, LLC, United States
Background: Real World Evidence (RWE) is typically conducted using commercially acquired real-world data sources where missing data occurs regularly. The reasons for missingness can be unrelated to patient characteristics, for example payer or provider preferences may influence what data is captured in an electronic health record. This type of missingness can be viewed as missing completely at random (MCAR).
Objectives: This research was conducted to evaluate which commonly available imputation methods available in R software performed best when conducting a linear regression and data are considered MCAR. Using simulation, we varied the percentage of missing data, sample sizes and correlation among variables.
Methods: A simulation study evaluating six imputation methods (MICE, Amelia, MissForest, Hmisc, mi, and Distribution Based Imputation) available in R were tested against complete case and listwise deletion. Data was simulated from a multivariable linear regression with one dependent variable and four independent variables. Correlations among variables were varied from .2 to .8. Root mean squared error for overall accuracy and percent bias were assessed.
Results: The best method of imputation depended on which model estimand the researcher is interested in from the linear model. When correlation was low, the least bias was observed for Hmisc (0.6%) while the worst was DBI (34.1%); for high correlations the least bias was Hmisc (4.4%) and the most biased was DBI (47.9%). When correlation was low, the least bias for β was observed for Amelia (1.7%) while the most bias was seen in MICE (36.1%); for high correlations the least bias was Amelia (3.8%) and the worst was MICE (49.6%). Increasing amounts of missingness lead to the most bias in estimates regardless of method.
Conclusions: When missingness is above 40% for low or moderate correlation levels imputation is not recommended. For high correlations or sample sizes imputation can be used for missingness < 55%. When samples are large, with highly correlated data and missingness is less than 15%, no imputation is needed. In RWE research, consideration should be given to the mechanism of missing and amount missingness before choosing an imputation method. If missingness results of a random process (MCAR) rather than observable characteristics (missing at random) the appropriate choice may differ.