Senior Client Partner, RWE Cerner Enviza, an Oracle company Norwalk, United States
Background: Missing or uninterpretable data affect the quality and usability of Electronic Health Records (EHRs). For social determinants of health (SDOH), incomplete data can introduce bias into study interpretation and impact vulnerable populations.
Objectives: To examine missingness of patient-level SDOH according to healthcare system-level characteristics in the US Cerner Real-World Data (CRWD).
Methods: Data were examined for patients in the US CRWD, a cloud-based, de-identified, and Health Insurance Portability and Accountability Act-compliant dataset (extract 10/2022). Patient-level data for year of birth (YOB), gender, race, ethnicity, state (residence), marital status, and language (spoken) were examined. Null values, responses of refused/declined to answer, and uninterpretable data were considered uninformative (herein missing). Healthcare system-level characteristics included size (number of beds), segment, specialty, and region (1-digit zip code). Descriptive statistics were used to examine data quality of the CRWD in the past 5 years (53.2 million patients; 136 healthcare systems) and overall (103.1 million patients; 136 healthcare systems). This study received IRB exemption status.
Results: In the past 5 years, patient-level data were missing for YOB = 0.17%, gender = 0.16%, race = 15.35%, ethnicity = 23.26%, state = 11.60%, marital status = 17.00%, and language = 11.97%. Missingness varied by healthcare system characteristic, for example, healthcare systems with < 100 vs 500-999 beds had substantially fewer missing data for YOB, gender, and language (0.03% vs 0.39%; 0.06% vs 0.24%; 3.91% vs 14.46%, respectively). By segment, missing data for race, for example, ranged from 7.08% for critical access to 22.90% for children’s healthcare systems. For specialty, medical practices had consistently more missing data for race, ethnicity, and marital status compared to critical access hospitals, acute/short-term hospitals, or IDN/regional or state health systems. By healthcare system location, greater missingness was observed in region 3 (AL, FL, GA, MS, and TN) for YOB (0.58% vs 0.03% to 0.33%), for region 5 (IA, MN, MT, ND, SD, and WI) for race (25.47% vs 9.50% to 17.38%) and marital status (34.70% vs 8.14% to 24.62%), and for region 7 (AR, LA, OK, and TX) for state (34.57% vs 3.07% to 23.69%) compared to other regions, respectively. Percent missing data showed a relative decrease of 10.04%-70.18% in the past 5 years relative to the overall CRWD.
Conclusions: Missingness of SDOH vary substantially by healthcare system characteristics, however, is declining over time. Data quality assessment at the healthcare system level is a necessary methodological consideration when designing and conducting real-world evidence studies.