Background: Inverse probability of treatment weighting using propensity scores (IPTW-PS) is a popular method to adjust for confounding when subjects have not been randomized to treatment. Results of IPTW-PS comparisons can be sensitive to extreme weights, which occur if treated subjects have a small propensity score or untreated subjects have a high propensity score. Truncating or “trimming” weights can lead to a more stable estimate of the treatment effect, but may lead to bias. In the clinical trial literature, the Fragility Index is an established method for assessing robustness by computing the minimum number of subjects whose event status would need to change before a statistically significant result becomes not significant. Employing a similar approach to the Fragility Index in an IPTW-PS setting has the potential to identify issues with non-robust estimates.
Objectives: We propose several practical sensitivity analyses to evaluate the robustness of IPTW-PS estimates to the impacts of subjects with the largest estimated weights.
Methods: We simulated time-to-event data according to various specified hazard ratios (HR), patterns of censoring, and event rates, along with continuous prognostic covariates which differed in distribution between treatment and control groups. A Cox-proportional hazard model was used to estimate the hazard ratio using IPTW-PS weights. We then conducted sequential sensitivity analyses focused on the ten subjects with the largest weights (from largest to smallest) and re-estimated the hazard ratio at each step, under three alternatives: (1) The subject with the largest weight is dropped; (2) The subject with the largest weight is replaced with a copy of an existing subject with median weight; (3) The subject with the largest weight is censored at their event time (if an event occurred). We then characterize changes in the estimated HR relative to the true underlying treatment effect using graphics and summaries.
Results: We demonstrate that HRs can be dramatically impacted by a few observations with large weights. In our scenarios, removing or replacing one or two of the observations with the largest weights tends to result in estimated HRs that are much closer to truth. We find that HR estimates are more stable after removing or replacing those subjects with the five largest weights.
Conclusions: Observations with large IPTW weights will tend to destabilize estimates of the treatment effect. In practice, we recommend assessing the robustness of IPTW-PS weighted results by sequential removal or replacement of subjects with the largest weights. These three alternatives are practical to implement and can complement existing methods for handling extreme IPTW-PS weights such as trimming.