Director, Data Strategy & Partnerships Merck & Co. Inc. Boston, United States
Background: Synthetic data, generated using complex algorithms, is an emerging method for preserving patient privacy and retaining statistical property in real-world databases. While synthetic data holds great promise, its applicability in conducting robust epidemiology research, requires stringent validation and demonstration projects.
Objectives: To assess the validity of synthetic EHR data by comparing the estimates of COVID-19 vaccination effectiveness in synthetic vs. original datasets.
Methods: A published retrospective cohort study on real-world effectiveness of COVID-19 vaccines by Maccabi Healthcare Services in Israel was replicated using synthetic data generated from the same database, and the results from synthetic RWD were compared with that from the original RWD. The original cohort included 1.2 million members vaccinated with BNT162b2. The endpoints include COVID-19 infection, symptomatic COVID-19 infection and hospitalization due to infection and were also assessed in several subgroups (demographic/clinical characteristics). In comparing synthetic vs. original data estimates, several metrics were utilized across analyses: standardized mean differences (SMD), decision agreement, estimate agreement, confidence interval overlap, and Wald test. Synthetic data were generated 5 times to assess the stability of results.
Results: The size of the synthetic cohort ranged from 1,178,081-1,178,085 in five replicates, compared with 1,178,597 in the original data. The distribution of demographic characteristics in synthetic data highly resembled the original data, with all SMDs < 0.01, indicating a very small difference. Comparing the replicates to the original data, the hazard ratio (of hospitalization) estimates had 100% estimate agreements in all groups, 100% decision agreement in 5 of 9 risk subgroups, and at least 70% CI overlap. The odds ratio (of COVID infection) estimates showed 100% of estimate agreements except for 3 subgroups, and 100% of decision agreement except for 2 subgroups, and varied level of CI overlap (37% - 99.7%). Vaccine effectiveness for COVID-19 infection and symptomatic COVID-19 infection showed 100% of estimate and decision agreements across all risk subgroups; had at least 82.7% CI overlap in COVID-19 infection, and at least 66% CI overlap in symptomatic COVID-19 infection.
Conclusions: Our analysis demonstrated good validity and stability of synthetic data to conduct a study of moderate complexity and the potential as a reliable alternative to inaccessible EHRs data. While using synthetic data may lead to the same conclusions as original data, there are important considerations for sampling strategy and results interpretations.