Background: Missing covariate data is a pernicious problem in epidemiologic research, particularly in studies that rely on routinely collected healthcare data like electronic health records. Single arm trials with external comparators commonly leverage such routinely collected data to create external control arms (ECAs), meaning they frequently face situations where there is missing data in the ECA but no missing data in the trial arm. Past research on the validity of complete case analysis (CCA) for missing covariate data has focused on estimating the average treatment effect (ATE) in the whole population. When conducting a study with an ECA, however, researchers are often focused on estimating the average treatment effect in the trial participants (ATTrial). When CCA is valid for the ATTrial is unknown.
Objectives: Evaluate the validity of CCA in a simulation where only ECA participants are missing covariate data.
Methods: We simulated 500 replicates of a 20,000-person study of the effect of a binary treatment X on a binary outcome Y and 3 confounders C1-C3 that were also modifiers of the risk difference (RD). We induced missingness in C1 only among those with X=0 (i.e., in the ECA) under 4 scenarios A) completely randomly (MCAR); B) randomly based on C2 and C3 (MAR); C) randomly based on C1 (MNAR); or D) randomly based on Y (MAR). We estimated the ATE (i.e., the effect in the combined population of trial participants and ECA patients) using inverse probability of treatment weights and the ATTrial (i.e., the effect within the trial participants alone) using odds weights under each scenario. We compared the average of the resulting estimates with the true ATE and ATTrial estimates without any missing data on C1.
Results: The true ATE RD was 0.170 and the true ATTrial RD was 0.206. For the ATE, CCA was always biased (Scenario A RD: 0.175, B: 0.164; C: 0.163; D: 0.256). In contrast, for the ATTrial, CCA was unbiased except when Y was associated with missingness (Scenario D RD: 0.218). Even though complete cases of the ECA were not a random sample of the ECA in Scenario B and C with respect to C1, the odds weights successfully standardized their distribution of C1 to that of the trial participants.
Conclusions: Complete case analysis can be a valid approach to missing covariate data in ECA when estimating the ATTrial, even when the probability of missing covariate data depends on other covariates or directly on the value of the covariate itself. Researchers should be wary of casually extending conclusions about missing data or other types of information bias that were developed in the context of the ATE to the ATTrial.