A domain-specific natural language processing (NPL) pipeline showed strong performance in extracting clinically meaningful information from diverse clinical documents of patients with non-small cell lung cancer
Electronic Medical Records (EMRs) represent a rich but heterogeneous data source for monitoring cancer trajectory and refine treatment strategies according to tumour evolution, however their full potential remains constrained by current technologies’ limited ability to interpret unstructured data. Natural language processing (NLP) approaches have emerged as a promising approach to extract clinical information from oncologists’ narrative-based documentation, as confirmed by a study described on the ESMO Real World Data and Digital Oncology (ESMO Real World Data and Digital Oncology, 2026; Vol 11, 100660 ).
At the University Hospital of Toulouse, France, a domain-specific NLP pipeline was developed by combining rule-based and machine learning algorithms to identify key variables while preserving contextual attributes, including temporality, data granularity, and clinical certainty, such as hypothesis or uncertain event. The NLP solution was used to analyse 1,028 clinical documents (discharge summaries and external consultation letters) from 120 patients with non-small cell lung cancer treated with oral targeted therapy. The system’s performance was then assessed by comparing the automatically extracted facts with expert-curated annotations, considered as being the reference standard.
The domain-adapted NLP solution achieved an F1-score of 79.7% for tumor evolution concept extraction and 62.0% for temporality alignment. Commenting on these findings, Dr Rodrigo Dienstmann, working at the Oncoclínicas & Co, Brazil, and Vall d’Hebron Institute of Oncology, Spain, and Editor-in-Chief of the ESMO Real World Data and Digital Oncology peer-reviewed journal, highlights that NPL-based strategies have the potential to support real-world endpoint reconstruction which is critical for optimising cancer care.
“Progress in real-world oncology analytics depends on bridging documentation and data,” he notes. “NLP must not only read what clinicians write but also reconstruct longitudinal trajectories in a way that is scalable, interoperable, and sufficiently robust to inform outcomes research.”