Hi @lg-ohdsi,
Thanks for the comment. I completely agree with you about the complexity involved in capturing all the symbolic logic.
We are definitely not aiming to translate the full spectrum of expert reasoning into a rigid symbolic expert system. Our current thinking is more modest, for instance, to identify a subset of recurring structural and semantic patterns that could be formalised, such as common join constraints, concept usage rules, and standard concept enforcement, while leaving room for ambiguity.
Let me share an example to illustrate what we mean. In the example below, both queries execute successfully. However, the first one does not properly follow OMOP semantic rules because it retrieves descendants from concept_ancestor without validating that they are standard concepts, belong to the correct domain, and are not invalid.
WRONG QUERY
WITH
risk_cond_concepts AS (
SELECT descendant_concept_id AS concept_id
FROM concept_ancestor
WHERE ancestor_concept_id IN (320128,40481919,3654996,201820,436940,433736)
),
anthracycline_drug_concepts AS (
SELECT descendant_concept_id AS concept_id
FROM concept_ancestor
WHERE ancestor_concept_id = 1338512
),
radiation_cond_concepts AS (
SELECT descendant_concept_id AS concept_id
FROM concept_ancestor
WHERE ancestor_concept_id = 4326962
),
structural_cond_concepts AS (
SELECT descendant_concept_id AS concept_id
FROM concept_ancestor
WHERE ancestor_concept_id IN (319835,3023670,3023680,312912,313217,4145279)
),
biomarker_measure_concepts AS (
SELECT UNNEST(ARRAY[3022246,3022275,3016762]) AS concept_id
),
risk_patients AS (
SELECT DISTINCT person_id FROM condition_occurrence co
WHERE co.condition_concept_id IN (SELECT concept_id FROM risk_cond_concepts)
UNION
SELECT DISTINCT person_id FROM drug_exposure de
WHERE de.drug_concept_id IN (SELECT concept_id FROM anthracycline_drug_concepts)
UNION
SELECT DISTINCT person_id FROM condition_occurrence co2
WHERE co2.condition_concept_id IN (SELECT concept_id FROM radiation_cond_concepts)
),
exclude_structural AS (
SELECT DISTINCT person_id FROM condition_occurrence co
WHERE co.condition_concept_id IN (SELECT concept_id FROM structural_cond_concepts)
),
exclude_biomarker AS (
SELECT DISTINCT person_id FROM measurement m
WHERE m.measurement_concept_id IN (SELECT concept_id FROM biomarker_measure_concepts)
)
SELECT COUNT(DISTINCT rp.person_id) AS patient_count
FROM risk_patients rp
LEFT JOIN exclude_structural es ON rp.person_id = es.person_id
LEFT JOIN exclude_biomarker eb ON rp.person_id = eb.person_id
WHERE es.person_id IS NULL AND eb.person_id IS NULL;
CORRECTED QUERY
WITH
risk_cond_concepts AS (
SELECT ca.descendant_concept_id AS concept_id
FROM concept_ancestor ca
JOIN concept c
ON c.concept_id = ca.descendant_concept_id
WHERE ca.ancestor_concept_id IN (320128,40481919,3654996,201820,436940,433736)
AND c.standard_concept = 'S'
AND c.domain_id = 'Condition'
AND c.invalid_reason IS NULL
),
anthracycline_drug_concepts AS (
SELECT ca.descendant_concept_id AS concept_id
FROM concept_ancestor ca
JOIN concept c
ON c.concept_id = ca.descendant_concept_id
WHERE ca.ancestor_concept_id = 1338512
AND c.standard_concept = 'S'
AND c.domain_id = 'Drug'
AND c.invalid_reason IS NULL
),
radiation_cond_concepts AS (
SELECT ca.descendant_concept_id AS concept_id
FROM concept_ancestor ca
JOIN concept c
ON c.concept_id = ca.descendant_concept_id
WHERE ca.ancestor_concept_id = 4326962
AND c.standard_concept = 'S'
AND c.domain_id = 'Condition'
AND c.invalid_reason IS NULL
),
structural_cond_concepts AS (
SELECT ca.descendant_concept_id AS concept_id
FROM concept_ancestor ca
JOIN concept c
ON c.concept_id = ca.descendant_concept_id
WHERE ca.ancestor_concept_id IN (319835,3023670,3023680,312912,313217,4145279)
AND c.standard_concept = 'S'
AND c.domain_id = 'Condition'
AND c.invalid_reason IS NULL
),
biomarker_measure_concepts AS (
SELECT c.concept_id
FROM concept c
WHERE c.concept_id IN (3022246,3022275,3016762)
AND c.standard_concept = 'S'
AND c.domain_id = 'Measurement'
AND c.invalid_reason IS NULL
),
risk_patients AS (
SELECT DISTINCT person_id FROM condition_occurrence co
WHERE co.condition_concept_id IN (SELECT concept_id FROM risk_cond_concepts)
UNION
SELECT DISTINCT person_id FROM drug_exposure de
WHERE de.drug_concept_id IN (SELECT concept_id FROM anthracycline_drug_concepts)
UNION
SELECT DISTINCT person_id FROM condition_occurrence co2
WHERE co2.condition_concept_id IN (SELECT concept_id FROM radiation_cond_concepts)
),
exclude_structural AS (
SELECT DISTINCT person_id FROM condition_occurrence co
WHERE co.condition_concept_id IN (SELECT concept_id FROM structural_cond_concepts)
),
exclude_biomarker AS (
SELECT DISTINCT person_id FROM measurement m
WHERE m.measurement_concept_id IN (SELECT concept_id FROM biomarker_measure_concepts)
)
SELECT COUNT(DISTINCT rp.person_id) AS patient_count
FROM risk_patients rp
LEFT JOIN exclude_structural es ON rp.person_id = es.person_id
LEFT JOIN exclude_biomarker eb ON rp.person_id = eb.person_id
WHERE es.person_id IS NULL AND eb.person_id IS NULL;
The corrected version explicitly enforces:
- standard_concept = ‘S’
- appropriate domain_id
- invalid_reason IS NULL
From a purely syntactic SQL perspective, both queries are valid. But from an OMOP semantic perspective, only one is correct.
This is the type of high-impact, recurring validation pattern we are hoping to formalise, not the entirety of expert reasoning, but specific guardrails that significantly reduce semantic errors.
Perhaps a more realistic goal is not to fully encode expert reasoning, but to guide LLMs toward safer abstractions and enforce a limited set of high-impact validation constraints.