Research Assistant Professor Vanderbilt University Medical Center, United States
Background: Mortality is one of the most important outcomes in medical product outcome assessments. However, obtaining accurate and timely data on the date and cause of death can be challenging due to limitations in vital statistics reporting in some routinely collected healthcare data sources. There are a number of publicly available data sources reporting deaths of individuals, such as obituaries and social media, that are potentially useful in overcoming some of these limitations. However, there are challenges in linking public and clinical data records together for patients and in extracting information from narrative text.
Objectives: We designed and implemented an NLP pipeline for identifying and extracting the date and clinical information supporting potential contributions to death from multiple publicly available sources.
Methods: We extracted 56,000 and 252,000 public posts from Ever-loved and Tribute archive websites, respectively. We then created annotation guidelines to define information related to the death event. Three clinician annotators were trained on these guidelines, and after reaching an acceptable agreement rate, independently annotated 1050 posts identifying Decedent Name, Bereaved Name, Cause of Death, Date of Birth, Date of Death, and any irrelevant dates, with a total of 64,490 individual annotations. We divided the annotated posts into 70%, 20%, and 10% for train, validation, and test datasets, respectively. We then experimented with deep learning transformer-based language model approaches (i.e., BERT, RoBERTa, ALBERT, and BERTweet) for building the NLP pipeline. We used sensitivity, positive predictive value (PPV) and F1-score (harmonic mean of sensitivity and PPV) to measure the performance of the developed model.
Results: With a relatively small annotated dataset (approximately 700 posts for training), the NLP model, fine-tuned using the RoBERTa architecture, demonstrated reasonable performance with an F1-score, sensitivity, and PPV of 66%, 71%, and 61%, respectively. This achievement is noteworthy given the complexity of certain tasks, such as extracting clinical contributors of death, which presented challenges even for the annotators.
Conclusions: The NLP model trained on a preliminary annotated dataset demonstrated the potential for effectively extracting entities related to death events from publicly available data. Future work will expand the annotation corpus and refine the algorithm. In summary, the developed NLP model can support the detection of dates and contributing causes of death from publicly available data sources, in near-real time. Ascertainment of mortality from routinely collected health data can be augmented upon probabilistic linkage of this information with patient health records.