Director, RWE Rare Disease Scientist Pfizer, United States
Background: Tokenization of health data is a relatively new process used to link real-world data (RWD) from disparate sources and allows for tracking engagement across the healthcare system without any breach in the patient’s privacy.1 However, limited studies have been published describing its accuracy and effectiveness in analyzing specified populations.
Objectives: We assess the accuracy of Datavant’s tokenization process to match patients in a well-characterized subset of patients with hemophilia B (PwHB) between a medical record (MR) dataset and a large claims database.
Methods: We linked 93 PwHB in PicnicHealth’s (PH) Hemophilia B (HB) RWD Cohort to the Komodo Health (KH) claims database, the largest aggregate claims dataset in the US, to assess the potential accuracy of tokenization (January 2015 – April 2022). Both datasets contain tokens created using first name, last name (matched phonetically via the SoundEx algorithm), date of birth (DOB) and gender via Datavant. We evaluated the potential accuracy of each of the matches by determining whether matches in KH appeared to be true PwHB (i.e., hemophilia-related medication, labs routinely utilized in the management of PwHB or HB diagnosis) or have similar ZIP codes for sites of healthcare encounters compared to PH.
Results: After one data refresh due to technical errors, PH produced tokens for 93 patients which merged to 114 tokens in KH with 14 patients having duplicate tokens in KH; 10 patients had two tokens, three patients had three tokens and one patient had six tokens. PH did not produce duplicate tokens. Initially there was a 97% (91 patients) match; but after the refresh/review of the PH’s tokens, there appeared to be a 100% match to a KH token.
In the KH dataset, seven (7.5%) patients of the 93 patients did not have hemophilia-related medication, labs of interest or HB diagnosis, suggesting a potential mismatch, ill-defined token or a lack of information in KH. Additionally, two (2.2%) of the patients did not have a matching ZIP code in KH which further suggests a potential mismatch in tokens. Overall, 0 (0%) of patients in KH and PH were missing ZIP codes.
Conclusions: Tokenization appears to be a useful tool to link disparate datasets. Improving knowledge of creating tokens to avoid technical errors/increase matching accuracy and completeness of PII could enhance the quality of linkage. Duplicate tokens suggest the variations of first and last name spelling/pronunciations may affect the ability to determine accuracy of matches. Furthermore, KH only uses two token variations – the additional variation being first initial, last name, DOB and gender. This can create limitations for linkage accuracy.