Using OMOP to capture EHR data for both AI and observational research

amartel · January 9, 2026, 10:29pm

OMOP was initially set up to support observational research and closed world data but many of us are trying to use it to make our local hospital data accessible for the development of diagnostic and predictive AI models where we need to set up positive and negative cohorts for training. At the same time we want to support as yet undefined observational research questions.
There are a few of areas where a mismatch between these use cases seems to lead to different ways of doing ETL. In particular, I’d be interested to hear more opinions on some points related to AI and imaging.

The use of negation- I understand why for observational studies it is only the presence of a condition or result that counts, but as has been pointed out in other posts, for areas such as pathology and radiology it is important to know that something was looked for but not found. Using question/answer pairs is a natural way of dealing with this, but atlas and other analysis tools are not set up well to exploit this structure. Should we create duplicate records for the positive findings - one using the QA structure and the other using a conventional omop approach?
A condition mentioned in a pathology or radiology report based on a single study may not carry the same degree of certainty as a condition entered into a registry where information from many sources will have been used to arrive at a diagnosis. In AI studies, we may be interested in cases where the condition suggested in one modality is subsequently modified after additional information is collected. Should all diagnoses be treated as conditions or is possible to differentiate between confirmed and possible conditions (observations?) How would we know the difference?

I realise that having specific use cases would be ideal but in our case we can’t tailor our ETL pipelines for specific project and I lean towards a more complete representation which may make it a little more involved to query. Is this going to render our OMOP implementation too non-standard for collaboraiton?

Christian_Reich · January 10, 2026, 2:51am

Hi @amartel:

This is indeed a question that comes up quite often. Which means, we must re-interrogate our own assumptions for the OMOP model, if we want to have integrity in our work. Because the current Closed World assumption we are making (everything that happens has a record, if there is no record it did not happen – as opposed to if there is no record we don’t know, only a negative fact is certain) is the basis of the all the analytics we are doing. Everything is based on rates of things, and the rates are roughly calculated as “patients with some record”/“all patients” of a certain population. To abandon the Closed World assumption would kill the denominator in all characterization, PLE and PLP approaches. Doing such a thing would carry a gigantic price and we would have to go back to the drawing board not just of OHDSI, but pretty much the entire RWE discipline.

So, the question is why would we want to make such a drastic change? What is the use case? Traditionally, I can think of two use cases that people came up with:

The “attic use case” – where people have source data and are wary of just throwing them away. Just because who knows why they could be useful. I don’t mean that in a condescending way.
The “check list use case” – which is what usually happens in things like pathology reports, clinical trials, registries and surveys. The negative facts in those don’t actually contribute to knowledge about the patient, they are more of a method to check on the reporter to make sure he or she really considered all the possibilities, and wasn’t just lazy when leaving something out. But pathologists or radiologists are trained to check all differential diagnoses and exclude them. They are grilled at the Board Certification exam for that. Which means, we must assume they are doing the right thing, whether it ends up in the report or not.

Now you are adding another use case, the “let AI follow the cognition process how diagnoses are made” one. I have to admit I have no idea what the models need to learn the data the right way. If you ask the public LLMs they are rock-solid in knowing alternative possibilities (differential diagnoses) in clinical diagnosis and decision making. But how they learned that – beats me. Do you know more about that?

amartel · January 11, 2026, 3:02am

I can give a simple example based on breast cancer. I can train a simple logistic regression model to predict whether a woman will have a cancer recurrence. One feature this model might use is whether the woman is HER2 positive of negative. If I only record HER2 positive cases, then when I pull data from my cohort of women who had breast cancer I can’t tell if the woman is HER2 negative or if the data is just missing or the test wasn’t run. This feature is no longer useful when training my model because it is too noisy. This is a very common scenario - we can only reliably assign a negative value to a feature if we know it was assessed and in the real world setting it is quite possible that a test isn’t carried out (of course the presence or absense of a test is not an independent variable but that is a whole other issue!) . A closed world database might help us train one class / anomoly detection models but not multi-class models. Radiology and pthology are fiull of examples where treatment planning and differential diagnoses depend on knowing for sure if something was present or absent.

amartel · January 11, 2026, 3:13am

I don’t understand why adding negative results would make it impossible to keep using closed world methods - can’t you just ignore the negatives? I’m not an epidemiologist or a statistician so I apologise if this is a really naive question!

Christian_Reich · January 11, 2026, 6:30pm

@amartel:

You are hitting the nail on the head. Yes, we have a Closed World assumption, but the data are not 100% following it. The bigger impact something has on the life of the patient, the better we are approaching the assumption. E.g., “myocardial infarction” probably has close to 100% compliance, “Itch on the back” close to 0%. The consequence is misclassification. We would count somebody with myocardial infarction or an itch on the back erroneously in the denominator, when in fact the patient should be in the numerator. In other words, we are watering down the signal. If we went Open World and kicked out all the “unknowns” we would probably do better, except if the unknowns were highly biased in their capture one way or the other, and assuming that we get the negative data (which we usually don’t).

Furthermore, should we not assume the providers check everything? If a pathology or genomic report doesn’t say a variant is negative – should we assume it was not tested? Or should we assume the reporter turned over every stone, and, say, BRCA1 status was tested as a matter of course, and had it been positive it would be mentioned? In other words, the negative results don’t follow their own Closed World assumption, either. Estrogen, progesterone and HER2 receptors are atypical in this sense because their distribution is not skewed in favor of negative, like it is the case in almost all other medical entities.

The other problem you have is that negatives have no time stamp in most modalities. Let’s say you want to exclude a panic attack in a patient with suspected myocardial infarction. When did the patient did not have a panic attack? Every single day the MI diagnosis was recorded? Before? After? Today? Xmas day? After birth?

Finally, changing our approach as you proposed has a price: If we allowed negatives, any query would have to always check for negative and positive. Right now, they don’t. This would massively slow down the queries. But more importantly, there is a vast corpus of existing tools and analytics packages that would all become obsolete and would need refactoring and re-testing. Who is going to do that?

Summary:

Recording of negative facts is not reliable, either, and in fact much more unreliable
We need to asssume alternatives get considered in the process of medical decision making, even though they are not recorded.
Negative facts have no time stamp
Changing the model would be prohibitively expensive

So, what do we do? You have a use case, and it needs to be attended to (or dismissed by the community as something OHDSI doesn’t get into, which is not the case here).

I could think of several approaches:

We carve out all facts with an Open World assumption. There needs to be a clear justification. One could be the distribution: The Closed World assumption works so well because the entity is usually VASTLY less common than the negative. The three breast cancer receptors for example are different. We also would need to make sure the data supports this, and it isn’t just theoretical.

Once we have those, we could

Create special pre-coordinated negative concepts and some kind of flag for these so they can be filtered out.
Add special negative_concept_id fields to each OMOP table, leaving the normal concept fields empty
Come up with a convention where to look for the negative facts, like in special OBSERVATION records.

This would require substantial work and engagement of the community, but it could be done.

Thoughts?