Background: Various scoring systems are often used to assess the severity of common dermatologic diseases, which in turn, may inform subsequent treatment strategies. Many scores are included in clinical notes of electronic health records (EHRs) as unstructured free text. Extraction of severity scores from EHR data would be useful in studies that evaluate the effectiveness of pharmacologic therapies in the real-world setting.
Objectives: To use and validate a natural language processing (NLP) question-answering (QA) pipeline to extract disease severity from clinical notes of dermatology patients.
Methods: Unstructured clinical notes from 2017-2021 available in patient EHRs from 5 specialty dermatology networks in the OMNY Health Database were accessed and deidentified. Sentences were paired with prompts inquiring about severity for various dermatologic conditions (including psoriasis, atopic dermatitis, hidradenitis suppurativa, alopecia areata, bullous pemphigoid, and pruritis) if the structured EHR data also had a diagnosis code for the applicable condition. These question-context pairs were analyzed using a pipeline that featured a pretrained transformer-based NLP QA model, which included various string-based quality and relevance checks. A random sample of output (5%) was manually reviewed for accuracy. Accuracy was grouped by severity measure, by ordinal versus continuous severity scores, and by clinical assessment versus qualitative severity scores.
Results: Approximately 448M note entries (sentences or paragraphs) across 4.1M patients and 16.0M encounters were accessed. After analyzing applicable question-context pairs using the pretrained NLP QA model, 5,456 dermatologic severity scores were detected. Overall, 146/273 (53%) manually reviewed scores were accurate. Accuracy varied by severity measure. Highest accuracies were observed for the psoriasis body surface area (66/74; 89%) and hidradenitis suppurativa Hurley Stage (32/36; 89%) measures. The lowest accuracy was observed for the pruritis itch numeric rating scale (3/18; 17%) measure. Accuracy was 56% (54/97) for ordinal scores, 52% (92/176) for continuous scores, 68% (114/168) for clinical assessment scores, and 31% (32/105) for qualitative severity scores.
Conclusions: An NLP QA pipeline could be a valuable tool to extract a variety of dermatologic severity scores from unstructured clinical notes, especially given that many EHR data collection systems may not have dedicated structured fields to collect severity scores during the course of routine clinical care. Certain types of scores may perform better than others, and clinical assessment scores seem to be the most favored. Further research would be needed to refine NLP models and increase accuracy.