Background: Patients may interact differentially with the healthcare system based on their race and ethnicity and the type of disease they experience. Thus, the degree to which the distribution of race and ethnicity in real world data is representative of the U.S. population may vary by disease.
Objectives: Evaluate representativeness with respect to race and ethnicity of patients ≥ 18 years old with acute, chronic, and cancer-related diseases in three real world datasets.
Methods: This descriptive study used data from two electronic medical records (TriNetX and IQVIA ambulatory EMR) and the National Health Interview Survey (NHIS). Acute (COVID-19, myocardial infarction, stroke), chronic (asthma, COPD, type II diabetes, hypertension) and cancer (colon, lung, ovarian, pancreatic, prostate) diseases were selected for comparison based on availability in all three datasets. Patients with each disease in 2021 were identified using ICD-10 codes (TriNetX and IQVIA EMR) or disease name (NHIS) and classified into mutually exclusive race and ethnicity groups (non-Hispanic white, non-Hispanic black, non-Hispanic Asian, Hispanic, and non-Hispanic other). Using Chi-square tests, the distribution of race and ethnicity for each disease was compared to the distribution of race and ethnicity in 2021 derived from the 2020 census. The difference in proportion of patients of each race and ethnicity compared to the U.S. population was estimated for each disease (e.g. asthma) and then averaged across disease type (e.g. chronic) and dataset. Analyses using NHIS data incorporated the sampling weights.
Results: In 2021 the U.S. adult population was 59.3% white, 13.6% black, 6.1% Asian, 18.9% Hispanic, and 4.5% other race and ethnicity. NHIS data was more aligned with the U.S. population than the EMRs, but the distribution of race and ethnicity was significantly different from the U.S. population for all datasets and diseases (all p< 0.0001). White patients were overrepresented for all diseases with the greatest overrepresentation for cancer (17.9% above census). Hispanic patients were underrepresented across all diseases with the greatest underrepresentation for cancer (-12.8% below census). Black patients were overrepresented for chronic diseases (2.9%) and acute (2.0%) diseases but underrepresented for cancer (-1.2%).
Conclusions: The extent to which the distribution of race and ethnicity in real world data is representative of the U.S. population varies by dataset and disease. This may reflect differences in data collection, disease prevalence due to factors such as biology, or care-seeking behaviors influenced by socioeconomic status that impact diagnosis and treatment pathways.