Senior Principal Data Scientist F. Hoffmann-La Roche Ltd, Switzerland
Background: Synthetic real-world data (sRWD) may facilitate data sharing and thus accelerate collaboration across institutions.
Objectives: To generate sRWD from existing RWD and to describe the distributions and measures of association in between baseline characteristics and real-world progression free survival (rwPFS) in the source and sRWD datasets.
Methods: We conducted a retrospective cohort study using the nationwide US Electronic Health Record-derived de-identified Flatiron Health database. The study population included patients (pts) diagnosed with metastatic breast cancer (MBC) from Jan 1, 2011 through Feb 30, 2021 and initiating any antineoplastic treatments. Over a hundred baseline characteristics as well as rwPFS were derived from the original real-world dataset and stored as source cohort (SC). Three datasets were generated from the SC using conditional generative adversarial networks (CTG) with different levels of privacy control (medium:CTGm or low:CTGl) or classification and regression trees (CART). In each dataset, hazard ratios (HR) with respective 95%CI measured the association in between baseline characteristics and rwPFS. Distributions of baseline characteristics, ratio of HR (rHR) as well as proportion of overlap in HR confidence intervals (POCI) were used to describe the quality of the synthetic data sets. POCI was defined for each HR as the width of the overlap in the between the CI95% from SRWD with the corresponding 95%CI in OC divided by the average width of the two 95%CIs*100.
Results: A total of 9,770 pts with MBC were included in the SC and as many synthetic pts were generated in each sRWD cohort. The mean (SD) age of the cohorts were 62.6(16.6), 62.3(12.4), 61.3(14.3) and 62.2(12.5) in the SC, CTGm, CTGl and CART respectively. The other baseline characteristics were also largely replicated. The median rHR [IQR] were 1.0 [0.2], 1.1 [0.3] and 1.0 [0.1]; the proportion of POCI >0% was 62.9%, 62.5% and 98.0%; the mean (SD) POCI was 33.9(33.5),37.3(36.2) and 62.8(25.5) all for CTGm, CTGl and CART respectively.
Conclusions: The statistical relationships between baseline characteristics and outcomes in the synthetic cohorts appeared similar to those observed in the OC. CART appeared to outperform CTG, maybe at the cost of lower privacy. Despite the numerous applications of sRWD for facilitating collaboration, using sRWD remains challenging in practice owing to limited guidance on how to demonstrate adequate privacy protection as well as the seldom mention of sRWD in data use agreements.