Background: Identifying study cohorts in observational healthcare databases can be challenging, especially when diseases of interest are clinically complex and rare. Pulmonary arterial hypertension [PAH] is an etiologically diverse and rare subgroup of pulmonary hypertension [PH]. To determine which combinations of administrative medical codes (´algorithms`) allow the identification of PAH patients most accurately, a case validation is needed.
Objectives: To demonstrate how case validation can be done through linkage of a coded healthcare database with disease-specific clinical data, focusing on a use case in PAH.
Methods: We linked patient-level Electronic Health Record [EHR] data from Stanford University Medical Center with clinical records from the Vera Moulton Wall Center PH database. We applied over a dozen algorithms consisting of combinations of disease diagnosis, procedure, and/or drug codes to identify PAH patient cohorts in the EHR database. By using the clinical PH database, we assessed each patient’s cohort assignment against their true diagnosis (PAH versus other types of PH (´non-PAH`)).
Results: Linked EHR and clinical PH database records of 720 PH patients (558 true PAH and 162 true non-PAH patients) were used for case validation. Starting with a single diagnosis code and sequentially increasing algorithm complexity, case validation resulted in a decrease in sensitivity (from 100% to 7%) and negative predictive value [NPV] (from 70% to 0%), whereas specificity and positive predictive value [PPV] increased (from 0% to 100% and 78% to 100%, respectively). Using diagnosis codes alone resulted in over-classification of PH patients as PAH patients (up to 162 false positives), while combinations of diagnosis, procedure, treatment, and exclusionary codes correctly classified all non-PAH patients, but missed a considerable number of true PAH patients (up to 535 false negatives). Balanced algorithms in terms of sensitivity, specificity, PPV, and NPV included diagnosis, procedure, and drug codes, but no exclusionary codes. The characteristics of the PAH patient cohorts identified by the different algorithms were largely similar, with only minor variations.
Conclusions: Linkage of EHR data with clinical data allows precise patient-level case validation. Generalizability of results, however, depends on representativeness of the databases and coding practices. Our case study in PAH highlights a trade-off between different performance metrics, emphasizing the importance of aligning algorithm choice with the research objective.