Dear colleagues,
We would like to share our recent medRxiv preprint describing THESEUS (Text-guided Health-study Estimation and Specification Engine Using Strategus), a framework designed to translate natural-language study descriptions into executable OHDSI analysis code.
In observational comparative effectiveness research (CER), translating study designs into executable analytic code remains a major technical barrier. THESEUS addresses this by using large language models (LLMs) in a two-step pipeline:
- Standardization – converting free-text study descriptions into structured JSON specifications aligned with the OHDSI research framework
- Code generation – transforming these specifications into executable Strategus R scripts, with a self-auditing step to detect and correct execution errors
We evaluated the system using 15 published OHDSI-based CER studies and 5 non-OHDSI studies. Generated scripts achieved high executability, reaching near-complete execution success after the self-auditing step.
We also developed a prototype interface inspired by ATLAS, allowing users to input free-text study descriptions, review the structured specifications, and generate Strategus scripts.
Prototype:
We hope this approach can help lower the technical barrier for conducting reproducible observational CER within the OHDSI ecosystem, and we would greatly appreciate feedback from the community.
Best regards,
Chan, Hanjae, and Minseong