OHDSI Home | Forums | Wiki | Github

Using ChatGPT to compose queries against the OMOP CDM

Moving this topic to its own thread (out of FOSS thread)

@Thomas_White

40m

Is anyone having success with ChatGPT and OMOP?
I was impressed with the quality of code for questions such as:

Write a SQL query against the OHDSI CDM to build a cohort of patients with new onset diabetes

Write a SQL query against the OHDSI CDM to calculate the numerator and denominator for the HEDIS measure for Comprehensive Diabetes Care

In general, the SQL was completely accurate or only slightly off. The codes, on the other hand, were spotty. For the first one, it knew to use a code for HbA1c. For the others, it guessed wrong. This is still an accelerator, because it is easy to use Atlas to find the right set of concept codes and paste them in.

I expect that with custom ChatGPT/LLM training and/or an integration that calls the Phoebe API to get a more appropriate set of codes, it may be possible to prototype some queries via LLM prompts.

Just being a little more specific with the query produces more detailed results:

Write a SQL query against the OHDSI CDM to build a cohort of patients with new onset diabetes. Identify and include relevant clinical concepts and their OMOP concept IDs, include them in the query, and document the concepts with their names in a comment

-- Cohort definition: Patients with new onset diabetes

SELECT DISTINCT
  c.person_id
FROM
  -- Include all condition occurrences of diabetes
  condition_occurrence c
WHERE
  -- Limit to diabetes diagnosis
  c.condition_concept_id IN (
    -- Relevant clinical concepts for diabetes diagnosis:
    31967,  -- Diabetes mellitus type 1
    36684827,  -- Diabetes mellitus type 1 in remission
    31968,  -- Diabetes mellitus type 2
    4304378,  -- New onset diabetes mellitus
    45766052,  -- Drug-induced diabetes mellitus
    4248645,  -- Secondary diabetes mellitus
    433013,  -- Diabetes insipidus
    4304377,  -- Unspecified diabetes mellitus
    435311,  -- Hyperosmolar hyperglycemic nonketotic syndrome
    443452,  -- Diabetic ketoacidosis
    45757462,  -- Diabetic foot ulcer
    45766622,  -- Diabetic retinopathy
    45766623,  -- Diabetic nephropathy
    45757464,  -- Diabetic neuropathy
    45766625   -- Diabetes mellitus unspecified complication
  )
  AND NOT EXISTS (
    -- Exclude patients with any diabetes diagnosis prior to a 365-day lookback period
    SELECT
      1
    FROM
      condition_occurrence c2
    WHERE
      c2.person_id = c.person_id
      AND c2.condition_concept_id IN (
        31967,  -- Diabetes mellitus type 1
        36684827,  -- Diabetes mellitus type 1 in remission
        31968,  -- Diabetes mellitus type 2
        4304378,  -- New onset diabetes mellitus
        45766052,  -- Drug-induced diabetes mellitus
        4248645,  -- Secondary diabetes mellitus
        433013,  -- Diabetes insipidus
        4304377,  -- Unspecified diabetes mellitus
        435311,  -- Hyperosmolar hyperglycemic nonketotic syndrome
        443452,  -- Diabetic ketoacidosis
        45757462,  -- Diabetic foot ulcer
        45766622,  -- Diabetic retinopathy
        45766623,  -- Diabetic nephropathy
        45757464,  -- Diabetic neuropathy
        45766625   -- Diabetes mellitus unspecified complication
      )
      AND c2.condition_start_date < c.condition_start_date - INTERVAL '365' DAY
  );
2 Likes

The general model already works well, with some edits needed here and there.

I am currently looking into the tuning of ChatGPT to write queries and analysis/visuals with inspiration from Clark C. Evans and Kyrylo Simonovā€™s semantic approach with FunSQL. I think the harder task will come after data pulls and descriptives, when inferential and other statistics require subject-domain knowledge of the literature.

The semantic approach has been intuitive to those without a SQL background in describing providence of data, which is something I would prioritize along with reproducibility when using LLMs to generate code.

Hello. I am curious as to how you are using FunSQL with chatGPT.

Are you mapping your databases to FunSQL library concepts and then using chatGPT to write those FunSQL queries?

My background is physician informatics with infrastructure understanding but no direct coding. I have explored a bit using chatGPT to create SPARQL queries for a semantic knowledge graph I developed and curious as to your approach with OMOP.

1 Like

Through the application program interface (API) offered by OpenAI, itā€™s possible to tune their model by submitting pairs (submission, response). I keep each in a markdown file with some tagging.

My approach has been taking three forms of the same query:

  1. Natural language in the form of a question.
  2. FunSQL Julia code.
  3. T-SQL

For each form, I know what the return should be and have tested it to return roughly the same information on the OMOP model.

I tag Julia/FunSQL and T-SQL Quarto style Source Code blocks. Natural language is not tagged.

I then have a Python script that pulls that query set and makes pairings to get each of the six combinations (3 groups, 2 pairs per submission) and then submits them to the training API.

I have about 10 queries written/confirmed. From some initial review, above 100 submissions is when tuning starts to become worthwhile. Each full query set takes some time to package and confirm.

2 Likes

TL;DR: You can use PYOMOP for LLM-based querying of OMOP CDM using this simple pattern.

Iā€™ve added experimental support for LLM-based querying of OMOP CDM databases to pyomop. You can use any LLM vendor, including Vertex, Azure, or OpenAI (Use any available chat model). To use this feature, install from the develop branch as shown in the example. This feature is not yet in production.

If there is interest, I would be happy to add GPT4All integration as well. Any feedback would be great, please use GitHub Issues. If you plan to build complex chains for concept ID mapping, please consider using MEDPrompt.

Update: For those who reached out regarding the technical implementation, PyOMOP injects OMOP table schema and the foreign key relationships into the prompt using LlamaIndexā€™s Retriever for in-context learning along with the query. You can pass the output through MEDPrompt for a generative response. Feel free to experiment! Pull requests are welcome!

I believe this is an exciting area of research. However, I also think thereā€™s a gap between toy examples and actual clinical implementation that needs to be bridged before it becomes useful. I highly recommend reading @Juan_Banda 's contribution on this topic.

1 Like

I have been experimenting with the new GPTs in ChatGpt and came up with an interesting idea of building a ā€œMultilingual Athenaā€. This could enable people to search for medical terms in their native language in Athena and find corresponding standard concepts. I am not certain about its usefulness, but I wanted to share it with the community. Hereā€™s the link: ChatGPT - Multilingual Athena .

1 Like

Interesting! @Alvaro_Andres_Alvare : Does it use ChatGPT plugins?

1 Like

@beapen, Yes, Actually it works with Actions to connect to athena.ohdsi.org. Just to give you a practical example, check out this search done in Chinese for ā€˜Pneumoniaā€™: ChatGPT

1 Like

I have been slowly working on an API via FastAPI/OpenAPI for a GPT to reference. I wonder if PYOMOP could be used in some way for this? Iā€™d be happy to out-of-pocket some of the server and other fees for testing.

I agree with @schuemie that there is a large bridge here that needs to be built, and I think part of that is in the API development space.

The main issue I have had so far is controlling hallucinations, particularly with R&SQL code / mappings.

2 Likes

An idea to combat hallucinations could be a ā€œtree-tiered expertā€ setup. Picture specialized agents (GPTs) as branches, each for different tasks like R and SQL. The ā€œMasterā€ agent acts as the trunk, choosing the right expert based on input, ensuring reliable and accurate responses. For complex tasks, grow more specialized sub-agents for better performance.

1 Like

Hi this is Chester Guan and I am a Data Scientist working with OMOP CDM right now. I am very interested with the topic and would like to contribute if needed.

I saw there are some good jobs have been done:

  1. Criteria2Query https://www.ohdsi.org/web/wiki/lib/exe/fetch.php?media=resources:chi_criteria2query_ohdsi_symposium.pdf
  2. NOSTOS : GitHub - OHDSI/Nostos: Navigate OMOP-structured data via text-to-SQL

What are the key questions or problems we try to solve at this moment ?

t