From bottleneck to breakthrough: rethinking clinical data collection in the age of AI

Clinical Research
Doctor_reading_tablet_01

Every day, clinicians generate thousands of reports containing critical data that will never be analysed. Not because it lacks value — but because extracting it remains a bottleneck.

Author: Kevin Zarca, MD, Lifen
The content of this article was provided by LIfen and reflects the views of the company. ESMO does not endorse the content of this publication and does not take any responsibility for the accuracy, completeness or reliability of the information contained in such article.

Medical records contain a wealth of unstructured clinical data with significant potential to accelerate scientific discovery and to inform clinical decision-making. Yet this potential remains largely untapped. Harnessing real-world data for oncology research could transform the field, but a persistent challenge stands in the way: extracting structured, usable information from unstructured medical reports across multiple centers, in a form that is reliable, reproducible, and scalable.
Traditionally, this process has relied on manual abstraction by trained clinical research associates. While widely used and considered a reference standard, manual extraction is time-consuming, costly, and subject to meaningful variability, both within and across institutions. These limitations represent a recognized barrier to the scalability of multicenter research, and have driven growing interest in AI-assisted approaches as potential alternatives.

Evidence from the LUCC Initiative
A recent study published in Annals of Oncology (1) contributes to this body of evidence. Conducted within the LUCC (Large & Unified Cancer Cohort) initiative, it compared manual extraction, AI-only automation, and a hybrid Human-AI model across 10 public and private centers involving 311 lung cancer patients and 31 clinical variables.
AI alone was associated with a lower error rate than manual extraction (7.0% vs. 14.2%) and reduced inter-center variability by a factor of three. The hybrid model, in which clinical research associates reviewed only the 30% of cases flagged as uncertain by the AI, achieved the lowest error rate overall (4.4%) while maintaining a processing speed approximately four times faster than manual review.
These results are encouraging, though the study was conducted in a specific setting, and generalizability to other cancer types, broader variable sets, or different healthcare systems remains an open question.

Broader Implications: Access and Equity in Research
One of the motivations for automating data extraction is its potential to lower the resource barrier to participation in multicenter research. Manual abstraction requires dedicated staffing and infrastructure that may not be available at smaller or less-resourced institutions, contributing to selection bias in research cohorts and limiting the diversity of the evidence base.
AI-assisted approaches, if successfully implemented and validated across diverse settings, could help broaden participation and enrich databases with more representative patient profiles.

Looking Ahead
The integration of AI into clinical data collection is part of a wider shift toward more scalable and standardized real-world evidence generation. Questions around data governance, model transparency, and long-term system maintenance remain active areas for further interdisciplinary work, alongside the need for prospective validation in broader patient populations.
What the current evidence suggests is that thoughtful, hybrid combinations of automated processing and human review may offer a viable path toward higher-quality, more consistent data — an important building block for the next generation of multicenter clinical registries and real-world studies in oncology.

Lifen

Figure. AI-powered structured data extraction from oncology medical records. It automatically identifies and structures key clinical variables, including mutated genes, TNM staging, histological type, metastatic status, and PD-L1 expression, directly from unstructured medical documents such as discharge letters, consultation reports, and tumor board meeting reports

References

(1) Aldea, M. et al. Next-generation multicenter studies: using artificial intelligence to automatically process unstructured health records of patients with lung cancer across multiple institutions. Annals of Oncology 37, 490–502 (2026).

Lifen is a European healthtech company that uses AI to build large-scale, continuously updated patient cohorts from unstructured hospital records. Lifen extracts clinical variables from medical documents with full traceability to the source text, enabling multi-centric real-world evidence studies across various therapeutic areas. Lifen currently operates across 800+ hospitals.

This site uses cookies. Some of these cookies are essential, while others help us improve your experience by providing insights into how the site is being used.

For more detailed information on the cookies we use, please check our Privacy Policy.

Customise settings
  • Necessary cookies enable core functionality. The website cannot function properly without these cookies, and you can only disable them by changing your browser preferences.