OHDSI Home | Forums | Wiki | Github

Negative information in OMOP CDM

We’ve recently come across an issue that has a potential to evolve into a big discussion, so we wanted to bring it up and ask for your advice.
I know well enough (and spread this knowledge) that we don’t store negative information in OMOP (e.g. absence of a disorder) and create our cohorts based on the absence of events. While it works for regular non-ideal datasets, it looks inappropriate for a well-structured and meticulous dataset we’re working with.
So, the issue:
records with answers yes/no/unknown, e.g. metastasis to lung. We’d typically map the first reply to metastasis to lung and throw the other two away. In this case, it’s impossible to distinguish between the cohort of people that are more or less believed to be metastasis-free and the people we know nothing about. Moreover, we have a date when an organization started to collect this information.
So, imagine it’s 2010 for lung metastasis. We know nothing about the dataset, run a network study and include all patients that have no mention of mts into our control cohort. We will then assume that people didn’t have mts before 2010 at all!
The possible solutions I could think of:

  1. Add status/modifier/etc to indicate the absence of condition/procedure/measurement, put in the respective tables.
  • will be hard to track as different tables have different columns to store the info like this;
  1. Add a custom column to each table in CDM so that it will be a flag column.
  • will cause false-positive results in Atlas, won’t help unless you know this feature;
  1. Create an Achilles Heel report that will warn you if an event occurred for the first time after a long period of time.

And a broader question: should we correct our study design based on the features of a dataset?

Would love to hear your comments.


Very important point for OHDSI. Thank you @aostropolets for leading this.

The approximate number of prevalence of the concept as well as the presence / absence of concepts matters, I’d like to share the record counts for each concept_id within the community.
(Sharing the results from the Achilles would be the best, but it won’t be possible…)

1 Like

On the first day of Medical Informatics school they already teach you that the recording of the absence of something is entirely different from the absence of the recording of something.

I would think that the best way to capture this in the CDM would be to add a flag column to each table. All OHDSI applications (like ATLAS) would need to take that field into account.

I’m sure there are reasons why this hasn’t been done yet, but I don’t know what they are. We certainly never had this information before in most of our databases. Administrative claims data only records things that are present, and even EHRs tend to just record things that are present (and therefore fail the first class of Medical Informatics :wink: ) So maybe we just never had a good use case before.

1 Like

We could make it the same, @aostropolets. That’s not the problem.

The problem is that you will have two types of databases: With and without the negatives. And you will have to query them in different ways if you need, say, the incidence rate of something.

There is such a thing? :smile:

Very simple. That’s the general unspoken assumption or axiom of almost all databases. Absence of a record means absence of something. Why? Because most people are healthy. I makes no sense to enumerate the gazillions of possible diseases and declare their absence. Usually, you have a problem (or a small number of them) and see the doctor for it. Everything else is fine.

Except in some atypical situation like cancer, where there is e.g. the TNM system which explicitly indicates absense.

I can’t think of a good solution. Both are bad:

  1. If we add a flag or so, the beautifully simple OMOP CDM queries like select * from table where concept_id=12345 go away. Instead, you will always have to check the flag.
  2. If we want to keep the information, we either
  • Add a flag into each event table, with the disadvantage away
  • Put it into OBSERVATION, with observation_concept_id=<Absence of> and value_as_concept_id=<the entity that is absent>

I think I like the last solution the most. Because the negative records are so rare. Denmark is Denmark, but the rest of the world doesn’t have negatives. And in cancer we are working on a specific solution.

My 2 cents:

When a database contains a survey about a patient’s medical history, and the provider/researcher reports (or patient can self-report) ‘yes’, ‘no’, ‘i dont know’ to a series of diseases that they may have had in the past, then I would treat this artifact as a survey, and capture all question/answer pairs in the OBSERVATION table, in keeping with how our consistent conventions for survey data. With the new accepted survey table, the maintainance of the questionset can even be preserved. That allows for the complete provenance of all questions and answers, no matter what the answers might be. This is not a Denmark-specific solution, it should hold for all registries, surveys, etc.

Now, beyond the capture of the question/answer pairs, there may be information within the responses that gives confidence about the PRESENCE of a particular condition. In which case, a CONDITION_OCCURRENCE record can also be created to allow for recording the condition exists. We do not want to allow store the ABSENCE in the CONDITION_OCCURRENCE table, because this table is about what conditions DID occur, not what conditions were ‘asked’ about and did not occur.

There are different logical questions from an analytical use case perspective:

  1. Find people who have a condition: We want that to be primarily answered by ‘SELECT * FROM CONDITION_OCCURRENCE WHERE CONDITION_CONCEPT_ID IN {}’. Depending on the SENSITIVITY desired and time horizon, you can also look for ‘History of’ concepts that may be in the OBSERVATION table or consider if there are any survey responses that might shed some light on the disease. But we want CONDITION_OCCURRENCE to be the primary location to answer this question.

  2. Find people who DO NOT have a condition: Here, so long as we are within an OBSERVATION_PERIOD and have reason to believe we have confidence that conditions that are present should be observable, then we often make the assumption that ‘absence of evidence’ is suggestive of ‘evidence of absence’. So, again, we can rely on CONDITION_OCCURRENCE and find persons who never have a particular condition anytime in their history. Sometimes, there may be reasons why you want to be more SPECIFIC, in which case you may also assert that not only can a person not have a condition, but they can also not have any drugs indicated for that condition, no procedures typically administered for that condition, no measurements with values indicating the conditions, etc. Here too, if you want to increase SPECIFICITY, you could require a survey question/answer pair that positively confirmed ‘no medical history’. But note, that increase in specificity will come at a cost in sensitivity; at that point, we are back to the typical problem of phenotype evaluation and deciding how to handle the measurement error that exists for any cohort.

1 Like

In UK primary care databases (CPRD and THIN) absence of a condition can be recorded as an ‘Absence of condition’ observation, with the value being the concept that is negated. This ensures that it is not detected in simple searches. It is infrequently used; but there are structured data entry forms for particular sets of standardised information that might be collected (e.g. smoking status, diabetes check-up), which have additional data fields for yes/no answers. However this requires queries to search additional specific fields which are outside the scope of the base OMOP model.

This is a vexing problem. In presenting on this topic I’ll often say that the things we know positively (a surgical procedure, a drug dispense, a written prescription) may not even be real. That is, a person may not fill a written prescription, may not take a filled prescription, etc. In the past, things like diabetes and cancer were often coded (in the US claims system) to support reimbursement for a biopsy and fasting blood glucose. So a diabetes diagnosis did not mean someone had diabetes.

The absence of disease is even harder. In the EHR, we only see those things that come into that practice setting. My primary care physician may not receive information from an orthopedic physician. In the CPRD, it is not clear that once I see a consultant, that that information is accurately captured and coded. Hospital records may come back to the EHR as a PDF scan. In using any data set we need to know the sampling frame to know what is excluded and included. So we can’t say that someone doesn’t have a condition because it isn’t in the EHR. Similarly, there are conditions I may not want my employer to know, so I’ll pay out-of-pocket and they never make it into the claims system. There is no complete system.

The question at hand is a little different. How should we handle patient reports of disease absence? Wouldn’t it be analogous to use the laboratory observations in CPRD to say that someone didn’t have, say, diabetes based on their HBA1c. Should we then create an imputed field for all the possible negatives out of the observation data. Would this be prudent or useful?


Apart from the SURVEY table, which I am willing to give the benefit of the doubt, and the MEASUREMENT table, where presence and absence are canonical measurement categories, we should stay away from the negative. Because if we start this game, all our queries will have to be doubled. If you want to select for patients who have a myocardial infarction, you’ll have to find those with positive records (like what we do today) and in addition exclude those with a negative finding “Never had a heart attack”. Which will inevitably lead to contradictions (there will be that patient who has an MI record and a reported “never had an MI”). Such a system would add no value to what we have now. For incidence calculations the situation would be even worse. I wouldn’t even know how to count in those “never had…” records correctly.

No. This stuff is no good. We need to stick to the two axioms of observational data that folks expect to be true without explicitly calling them out:

Axiom 1: If there is a record it happened.
Axiom 2: If nothing happened there is no record (not a record of that negative fact)

Axiom 2 only applies inside an Observation Period.

So, as painful as it is - but toss those negative records into the bin.

I’d suggest that absence of the event could be an exercise at ETL time: if you are receiving data that indicates that no occurrence was found, then use that information to remove any recorded record of that event (within a time window) during ETL. That way, you get the axiomata @Christian_Reich calls for, while being able to use that valuable information to assert something about the condition of a patient (and possibly make your CDM more accurate). It’s up to the ETL’er to decide, who do you trust? the record that states something did happen , or the record that says something didn’t?

My 2¢ here:

We have (and I suspect a lot of other Epic shops might as well) an awful lot of this data - definitely not just the Danes.

It comes in to the history section of the note as “Pertinent negatives” and it’s recorded with the same interface terminology as the rest of our diagnoses - so we can map most of them directly to SNOMED, and those we can’t, we get ICD-9/10s for, so we can get to SNOMED via CONCEPT_RELATIONSHIP.

It isn’t there for everyone, true: but I don’t want to throw it out, either. I think there are lot of use cases where it could come in handy - especially high-specificity low-sensitivity queries of our CDM instance where someone only needs a few people (say, for trial accrual) and is willing to get rid of some patients who might actually meet their phenotype as long as they don’t have to manually sort through hundreds to assess whether or not they have some condition that falls into their exclusion criteria. Just the other day we got a request like this.

It seems to me that the solution @Christian_Reich mentioned above is the best: put it into OBSERVATION. The observation_concept_id can be 4132135 Absent, then the value_as_concept_id can be the SNOMED code for the pertinent negative. In our case these will likely be 99% concept_ids from the condition domain. This way those who need it will have it (although we’ll end up having to explain to them where it is) and those who don’t care about it won’t be impacted by it. I don’t understand why one would need to go beyond this. At the crux of it, all we have is assertions of the absence of a given condition, so it seems best to me to just be faithful to the source, carry over the assertions, and leave it at that, rather than making global statements about the true underlying patient state.

1 Like

Found this thread pretty good on how to think about negatives but at the same time confusing because I think it goes back on forth on what to do.

So if I wanted to store the information:
The patient did not receive respirator treatment

I would write a record to the OBSERVATION table where OBSERVATION_CONCEPT_ID = 3454241-Absence of with the VALUE_AS_CONCEPT_ID = 4203780-Respiratory therapy (or some other more applicable concept).

While slightly different idea, here is the survey responses in the OBSERVATION table as @Patrick_Ryan discussed earlier in the thread. It uses the OBSERVATION table in a similar way and I think @clairblacketer does a nice job of illustrating that.

Applying the OMOP Common Data Model to Survey Data, Blacketer

1 Like