Resources on SQL Query Standards for OMOP Compliance

Hi everyone,

I am currently working on a project focused on improving the explainability of AI-generated SQL queries. One approach we are exploring is the development of a sanitiser/validator that reviews LLM-generated SQL and checks it against a set of rules to ensure the query complies with OMOP conventions.

For example, the sanitiser might validate that:

  • tables referenced belong to the OMOP CDM schema.
  • clinical concepts are identified using concept_id rather than string matching.
  • standard concepts are used, or that non-standard concepts are appropriately mapped to standard concepts.

I am looking for documentation or resources that describe what makes a SQL query compliant with OMOP standards, particularly anything that formalises best practice or defines rules that could be used in an automated validator.

Does anyone know of existing resources or have suggestions for systematically defining compliance rules for OMOP SQL queries?

Many thanks in advance.
Shihao

@sshenzha:

I don’t think there is no such a thing as a valid SQL query. Apart from the fact that OMOP does not specify any database flavor or version, whether or not your query is valid depends on your use case. However, in Github we have a repo with generally useful OMOP queries.

Also, your vibe coder should be able to take our OMOP CDM DDLs and the model description, and create valid SQL. I don’t know if anybody has tried that. I haven’t. But I can imagine that the problem is that many of the queries rely on the inner logic of the OHDSI Standardized Vocabularies.

Let us know what you find out. This is cool stuff.

Hi @Christian_Reich

Thank you for your response. I completely agree that there isn’t a universally “valid” OMOP SQL since many queries are purpose-driven, and validity is inherently contextual.

The GitHub repository of useful OMOP queries is indeed an excellent reference point. However, the practical challenge we are encountering arises in a slightly different setting.

In our case, we are deploying a locally hosted LLM, which is significantly less capable than enterprise-grade models. In addition, we face context window limitations, meaning we cannot reliably provide the full OMOP CDM DDLs, vocabulary schema, and model documentation in every prompt. As a result, the model occasionally:

  1. References tables or fields incorrectly
  2. Fails to respect standard vs non-standard concept logic
  3. Ignores vocabulary hierarchy rules.

This is why we are exploring a post-generation validator layer. The goal is not to define a universally “valid” OMOP SQL query, but rather to enforce a structured set of rules, such as schema-level validation, domain-level validation, observation-period constraints when required, and avoidance of string-based filtering on clinical meaning, etc.

If you have any suggestions for systematically extracting rule constraints from the CDM DDL and vocabulary documentation, I would be very interested.

Thank you again for your insight!

Best regards,
Shihao

Oh! That changes the conversation entirely, @sshenzha. :slight_smile:

Yes, you want expert tuning of your model, otherwise it will hallucinate a lot of nonsense.

Maybe instead of feeding it “valid” standard queries you iterate with it? Let it create a query, critique it, and keep doing it until it got sufficient context?

Reason is: We really don’t actively and explicitly know the entire context ourselves, and not all areas of the context is relevant, even if it is available.

Also, your vibe coder should be able to take our OMOP CDM DDLs and the model description , and create valid SQL.

I am looking for what is the authoritative / single source of truth for the model description. Is it that HTML or is it the CSV files in CommonDataModel?

Hi @Christian_Reich

Blockquote
Yes, you want expert tuning of your model, otherwise it will hallucinate a lot of nonsense.
Maybe instead of feeding it “valid” standard queries you iterate with it? Let it create a query, critique it, and keep doing it until it got sufficient context?

Yes, this is exactly the direction we are aiming for. A module that iteratively critiques LLM-generated queries until they pass all the “valid” conventions.

One of the key challenges we’ve identified is that many of the heuristics experts use to determine whether an OMOP SQL query is correct are implicit rather than formally documented. That was one of the main motivations behind this post.

We have been working on this ourselves, but we feel the need to collaborate with experienced OMOP CDM and SQL experts to help surface and formalise these implicit rules, so they can be translated into explicit validation logic.

From your perspective, who would be the best group or individuals to approach for this?

Hi @sshenzha ,

Experts determine whether an SQL query seems correct by combining semantic concepts from healthcare and SQL syntax with symbolic logic. There will be heuristics to get to the SQL, but knowledge and experience will play a role as well. In some cases, experimentation will be involved as well (trying on small dataset, or checking first rows of output, to see if it “looks right”). In other cases, ambiguities about what is intended will be present. Experts will then seek clarifications, that is enter a context clarification loop (See Christian’s reply above).

My initial thought is that translating all this into symbolic logic (formal validation rules expert system) can get very complex. Scoping, boundaries, or other way to express limitations (even it not able to completely enforce them with an LLM) will be needed.

However, one thing I have noticed is that leaning on the parts of the CDM that are more semantic model than data model can be closer to the way LLMs function. Derived tables such as the *_era tables are one such example. Getting the LLM to prioritize them, and more generally manipulate higher-level of abstractions, could lead to less opportunities for mistakes by the LLM.

Hi @lg-ohdsi,

Thanks for the comment. I completely agree with you about the complexity involved in capturing all the symbolic logic.

We are definitely not aiming to translate the full spectrum of expert reasoning into a rigid symbolic expert system. Our current thinking is more modest, for instance, to identify a subset of recurring structural and semantic patterns that could be formalised, such as common join constraints, concept usage rules, and standard concept enforcement, while leaving room for ambiguity.

Let me share an example to illustrate what we mean. In the example below, both queries execute successfully. However, the first one does not properly follow OMOP semantic rules because it retrieves descendants from concept_ancestor without validating that they are standard concepts, belong to the correct domain, and are not invalid.

WRONG QUERY

WITH
risk_cond_concepts AS (
  SELECT descendant_concept_id AS concept_id
  FROM  concept_ancestor
  WHERE ancestor_concept_id IN (320128,40481919,3654996,201820,436940,433736)
),
anthracycline_drug_concepts AS (
  SELECT descendant_concept_id AS concept_id
  FROM  concept_ancestor
  WHERE ancestor_concept_id = 1338512
),
radiation_cond_concepts AS (
  SELECT descendant_concept_id AS concept_id
  FROM  concept_ancestor
  WHERE ancestor_concept_id = 4326962
),
structural_cond_concepts AS (
  SELECT descendant_concept_id AS concept_id
  FROM  concept_ancestor
  WHERE ancestor_concept_id IN (319835,3023670,3023680,312912,313217,4145279)
),
biomarker_measure_concepts AS (
  SELECT UNNEST(ARRAY[3022246,3022275,3016762]) AS concept_id
),
risk_patients AS (
  SELECT DISTINCT person_id FROM  condition_occurrence co
  WHERE co.condition_concept_id IN (SELECT concept_id FROM risk_cond_concepts)
  UNION
  SELECT DISTINCT person_id FROM  drug_exposure de
  WHERE de.drug_concept_id IN (SELECT concept_id FROM anthracycline_drug_concepts)
  UNION
  SELECT DISTINCT person_id FROM  condition_occurrence co2
  WHERE co2.condition_concept_id IN (SELECT concept_id FROM radiation_cond_concepts)
),
exclude_structural AS (
  SELECT DISTINCT person_id FROM  condition_occurrence co
  WHERE co.condition_concept_id IN (SELECT concept_id FROM structural_cond_concepts)
),
exclude_biomarker AS (
  SELECT DISTINCT person_id FROM  measurement m
  WHERE m.measurement_concept_id IN (SELECT concept_id FROM biomarker_measure_concepts)
)
SELECT COUNT(DISTINCT rp.person_id) AS patient_count
FROM risk_patients rp
LEFT JOIN exclude_structural es ON rp.person_id = es.person_id
LEFT JOIN exclude_biomarker eb ON rp.person_id = eb.person_id
WHERE es.person_id IS NULL AND eb.person_id IS NULL;

CORRECTED QUERY

WITH
risk_cond_concepts AS (
  SELECT ca.descendant_concept_id AS concept_id
  FROM  concept_ancestor ca
  JOIN  concept c 
    ON c.concept_id = ca.descendant_concept_id
  WHERE ca.ancestor_concept_id IN (320128,40481919,3654996,201820,436940,433736)
    AND c.standard_concept = 'S'
    AND c.domain_id = 'Condition'
    AND c.invalid_reason IS NULL
),
anthracycline_drug_concepts AS (
  SELECT ca.descendant_concept_id AS concept_id
  FROM  concept_ancestor ca
  JOIN  concept c 
    ON c.concept_id = ca.descendant_concept_id
  WHERE ca.ancestor_concept_id = 1338512
    AND c.standard_concept = 'S'
    AND c.domain_id = 'Drug'
    AND c.invalid_reason IS NULL
),
radiation_cond_concepts AS (
  SELECT ca.descendant_concept_id AS concept_id
  FROM  concept_ancestor ca
  JOIN  concept c 
    ON c.concept_id = ca.descendant_concept_id
  WHERE ca.ancestor_concept_id = 4326962
    AND c.standard_concept = 'S'
    AND c.domain_id = 'Condition'
    AND c.invalid_reason IS NULL
),
structural_cond_concepts AS (
  SELECT ca.descendant_concept_id AS concept_id
  FROM  concept_ancestor ca
  JOIN  concept c 
    ON c.concept_id = ca.descendant_concept_id
  WHERE ca.ancestor_concept_id IN (319835,3023670,3023680,312912,313217,4145279)
    AND c.standard_concept = 'S'
    AND c.domain_id = 'Condition'
    AND c.invalid_reason IS NULL
),
biomarker_measure_concepts AS (
  SELECT c.concept_id
  FROM  concept c
  WHERE c.concept_id IN (3022246,3022275,3016762)
    AND c.standard_concept = 'S'
    AND c.domain_id = 'Measurement'
    AND c.invalid_reason IS NULL
),
risk_patients AS (
  SELECT DISTINCT person_id FROM  condition_occurrence co
  WHERE co.condition_concept_id IN (SELECT concept_id FROM risk_cond_concepts)
  UNION
  SELECT DISTINCT person_id FROM  drug_exposure de
  WHERE de.drug_concept_id IN (SELECT concept_id FROM anthracycline_drug_concepts)
  UNION
  SELECT DISTINCT person_id FROM  condition_occurrence co2
  WHERE co2.condition_concept_id IN (SELECT concept_id FROM radiation_cond_concepts)
),
exclude_structural AS (
  SELECT DISTINCT person_id FROM  condition_occurrence co
  WHERE co.condition_concept_id IN (SELECT concept_id FROM structural_cond_concepts)
),
exclude_biomarker AS (
  SELECT DISTINCT person_id FROM  measurement m
  WHERE m.measurement_concept_id IN (SELECT concept_id FROM biomarker_measure_concepts)
)
SELECT COUNT(DISTINCT rp.person_id) AS patient_count
FROM risk_patients rp
LEFT JOIN exclude_structural es ON rp.person_id = es.person_id
LEFT JOIN exclude_biomarker eb ON rp.person_id = eb.person_id
WHERE es.person_id IS NULL AND eb.person_id IS NULL;

The corrected version explicitly enforces:

  • standard_concept = ‘S’
  • appropriate domain_id
  • invalid_reason IS NULL

From a purely syntactic SQL perspective, both queries are valid. But from an OMOP semantic perspective, only one is correct.

This is the type of high-impact, recurring validation pattern we are hoping to formalise, not the entirety of expert reasoning, but specific guardrails that significantly reduce semantic errors.

Perhaps a more realistic goal is not to fully encode expert reasoning, but to guide LLMs toward safer abstractions and enforce a limited set of high-impact validation constraints.

Hi @sshenzha,

It makes sense.

I recently experimented the use of a coding agent (Claude Code) for OMOP queries (here). Text and documentation about SQL queries for data in OMOP CDM, and possible pitfalls, seem already learned by the generalist LLM, and the coding agent appears somewhat good at pattern-matched translation between requests in natural language and SQL, accounting for OMOP CDM specificity (data model and semantic model). I understand that you are looking at smaller/local models but this could provide a baseline.

However, I approached it as a dialogue with a human expert in the loop. I am quite not sure about how systematic checks would be without an expert in the loop.

In any case, I think that a set of high-impact checks as you mention would have value for a target audience of experts in healthcare / epidemiology with knowledge of SQL but new to OMOP (or appreciating automated comments about their queries). Something akin to a schema/semantic (“schemantic”?) linter for SQL. That crowd would understand that the system does not provide guarantees of correctness, and that responsibility remains with the humans, but that there is benefit in receiving feedback as questions (e.g., “the query implies XYZ, which is less common; is this what you mean?”) or suggested changes.