(241) Unlocking the patient journey: comparing machine learning (ML) based natural language processing (NLP) and expert abstraction in understanding treatment patterns
Quantitative Scientist Flatiron Health, Inc New York, United States
Background: Understanding the utilization and effectiveness of oral cancer therapies at scale in the electronic health record (EHR) is critical for identifying unmet needs and drug development. While clinical experts can manually abstract this information from unstructured data, it is a slow and resource intensive process. We explored the impact of data curation method [expert-abstraction vs. ML-based NLP (ML-extraction)] on the ability to measure patient characteristics and real-world treatment patterns.
Objectives: To determine if using ML-extraction in place of expert-abstraction leads to similar patient cohorts and therapy utilization patterns in patients with advanced non-small cell lung cancer (aNSCLC) receiving first line (1L) osimertinib or alectinib.
Methods: Using a sample of 186,313 patients with a lung cancer ICD code from Flatiron Health’s nationwide (US-based) EHR-derived de-identified database, we extracted diagnosis dates, clinical characteristics (e.g., group stage), ALK/EGFR/PD-L1 status and oral drug usage from text documents using models trained on expert-abstracted data. Two patient populations with an aNSCLC diagnosis (2011–2022) were selected: an osimertinib cohort, with an EGFR mutation and 1L osimertinib treatment, and an alectinib cohort, with an ALK rearrangement and 1L alectinib treatment. Comparison groups were defined using either: 1) expert-abstracted variables or 2) ML-extracted variables. Patient characteristics and real-world treatment patterns were compared between data curation methods using absolute standardized mean differences (aSMD).
Results: The abstracted (osimertinib, n=1734; alectinib, n=306) and ML-extracted (osimertinib, n=1672; alectinib, n=296) cohorts had an aSMD < 0.1 for gender, race/ethnicity, age, practice type, smoking status, socioeconomic status, ECOG Performance Status, PD-L1 status and year of advanced diagnosis and an aSMD < 0.2 for group stage and follow-up time from advanced diagnosis to 1L start in the alectinib cohort. Subsequent lines of therapy had an aSMD=0.22 for 2L osimertinib and an aSMD < 0.2 in 2L alectinib and 3L for both drugs. The most common therapies following 1L in both cohorts were subsequent EGFR or ALK inhibitors respectively, followed by PD-L1 therapies and platinum-based chemotherapies.
Conclusions: Deducing lines of therapy from unstructured text in the EHR relies on the correct combination of multiple variables, such as oral drug identity, start and end dates, and advanced diagnosis date. When using ML-extracted variables trained on expert-abstracted oncology data, similar results can be achieved in downstream analysis as when using abstracted variables, unlocking the ability to understand drug utilization and patient treatment profiles at scale.