Associate Principal Scientist Merck & Co., Inc., United States
Background: The American Association for Cancer Research (AACR) Project Genomics Evidence Neoplasia Information Exchange (GENIE) Biopharmaceutical Collaborative (BPC), a 5-year collaboration of biopharmaceutical companies and participating cancer centers of the AACR Project GENIE, aims to advance precision oncology and enhance clinical decision-making by obtaining in-depth clinical and genomic data from 50,000 de-identified cancer patients in North America and Europe. In Phase I, the project generated datasets on 6 selected cancers, including non-small cell lung cancer (NSCLC).
Objectives: To conduct a general feasibility assessment of the AACR GENIE BPC NSCLC dataset for typical clinical and epidemiological analyses.
Methods: The AACR GENIE BPC NSCLC v1.1-consortium dataset was generated using the PRISSMM Data model. For clinical and pathological data, we focused on evaluating the availability and completeness of data elements. First, we compiled a list of data elements relevant to clinical and epidemiological analyses of NSCLC. Second, we determined whether these data elements were directly available as raw variables or could be derived. Last, we assessed the completeness of available data elements by calculating the percentage of missingness at the patient level. For genomic profiling data, we focused on describing the datasets and generating high-level summaries for individual biomarkers.
Results: A total of 1875 patients with NSCLC, primarily diagnosis between 2007-2017, were included in the AACR GENIE BPC dataset. Most key baseline demographics, patient and disease characteristics were available with minimal missingness, except for notable missingness in PD-L1 testing (~48%) due to the early diagnosis time period of the cohort. Data on patient follow-up (e.g., vital status, dates of death and recurrence/progression), systemic treatment (e.g., date of initiation, line of therapy, regimen), and clinical outcomes (e.g., overall survival, progression free survival) were also available with a high degree of completeness. Information on ECOG performance score, comorbidities, surgery, or radiotherapy was not available. Genomic profiling data included genetic mutation, gene copy number, and gene fusion/rearrangement datasets with different data structures. A total of 2033 tumor samples were tested, of which 1226 (60.3%) and 788 (38.8%) were primary tumors and metastatic lesions, respectively.
Conclusions: The AACR GENIE BPC dataset provide robust clinico-genomic data that allow in-depth analyses of oncogenic drivers and clinical outcomes in NSCLC. However, considerable data cleaning and manipulation are needed to generate analysis-ready datasets.