Structured Data Meets Large Language Models (LLMs)

TweetShareEmail
By ci2 Team

Beck Olson, a computational physicist and scientific software developer at the University of California, San Francisco Center for Intelligent Imaging (UCSF ci2), recently presented his work on an LLM-powered data assistant at the ci2 SRG Pillar meeting. His presentation, "Towards an LLM-Powered Data Assistant for Structured Data Sources," showcased a collaboration with Dr. Duygu Tosun-Turgut and Adam Diaz.

The project used data from the Alzheimer's Disease Neuroimaging Initiative (ADNI), a multi-center study aimed at validating Alzheimer's biomarkers. The dataset included CSV files, text, and imaging data. As Olson described, "It was very clear that we're dealing with a very complex data set… you can have temporal discrepancies, you can have site discrepancies…all these challenges lead to reduction in accessibility."

To address these issues, the team built a privacy-aware system using LLMs. Instead of sending data to the external LLM, a local SQLite database is generated from CSVs. "We want to make sure the data stays where we want it to," Olson emphasized.

The system's core is an "LLM-powered query engine" that interprets natural language questions and generates SQL queries for local execution. This ensures the model never accesses raw data while providing explanations of its logic.

A web-based UI lets users ask questions like "How many patients have more than one MRI, 3T exam?" and receive JSON-formatted responses. The system returns the query, explanation, and a validation step.

To test the tool, the team used actual user-submitted questions from ADNI's "Ask the Experts" portal. One query about tracking patients from early to late MCI was accurately addressed through schema interpretation.

Olson cited its speed, usability, and transparent output as major strengths. However, he noted its reliance on high-quality documentation and its limited ability to infer unspecified values. Planned improvements include support for more data types, better prompt engineering, and integration with other resources like UCSF's Information Commons. "Something like this could do that," Olson noted. This project demonstrates a practical use of LLMs in navigating complex datasets securely, making structured health data more accessible through natural language.

To learn more about the upcoming SRG Pillar meetings, visit the ci2 events page.