How to handle EAV variable/value pairs in MEASUREMENT or OBSERVATION - Call for input from the community

Christian_Reich · April 27, 2021, 8:08am

Friends:

The data in most of the OMOP CDM tables are statements of fact. But we have those EAV tables in which the information is organized in variable/value or question/answer pairs. The variable/question is always a concept, and the value/answer may or may not be a concept. I am talking about the ones where they are, and there are several sources of these:

Historic facts, such as Family history with explicit context combined with any Condition concept,
Tests in which the outcome is qualitative, for example the test SARS-CoV-2 (COVID-19) IgA+IgM [Presence] in Serum or Plasma by Immunoassay with the results Equivocal, Negative or Positive,
Published survey instruments, such as the PhenX Toolkit of complex diseases, phenotypic traits and environmental exposures or the Patient-Reported Outcomes Measurement Information System PROMIS or the American Physical Therapy Association Registry panel. LOINC sometimes indicates these in brackets,
Survey instruments created for projects, such as the UK Biobank or the All Of US Participant Provided Information (PPI),
Cancer staging and the assessment measures (Cancer Modifiers), which typically are published by Medical (Electronic Cancer Checklists from the College of American Pathologists) or Research Associations (Data Standards & Data Dictionary Volume II by NAACCR).

For 3)-5) and some of 2) the variables are explicitly paired to the values using relationships such as "Has answer". For 1) that is not the case.

Why do we have it that way? Why can’t we state them as fact like a Condition? For example, why not state “Positive SARS-CoV-2 (COVID-19) IgA+IgM [Presence] in Serum or Plasma by Immunoassay” and be done with it? Why do we need to split them up?

There are a few good reasons: First, if the relevant time stamp is on the answer, rather than on the question (like in 1). Second, if the number of values is so high that pre-coordinating the two would amount to duplication of the vocabulary. And three, for lab tests where the test and the test result are clearly different entities and their dual nature is intuitively expected.

But in many other constellations it is not that useful, because simple statements of facts are arbitrarily split into EAV pairs:

Many pairs contain the key information in the question, for example the pair Do you have high blood pressure [PhenX] - Yes.
Others have the information in the answer: Primary health condition - Cardio/pulm: Pulmonary hypertension.
Some have it in both parts of the pair, such as Admitted from Facility - Hospice.

Really, the above concepts should be Conditions (Hypertensive disorder, Pulmonary hypertension) or Visits (Hospice), to be recorded into their respective CDM tables. And then you have all those Don’t know and No answers that are flavors of null and need no record at all.

The problem is having information spread over variables, values or both makes them almost undiscoverable, unless you are explicitly working with a survey and know where to look. Nobody who is developing a cohort of patients with, say, Pulmonary Hypertension is going to look into the OBSERVATION or MEASUREMENT table and dig out some survey combo.

So, what should we do? How should we handle these?

We make all pairs non-standard. Then you can use them if you know they are there, but otherwise we don’t claim to be able to effectively utilize them.
We make them all standard (Observation or Measurement domain). Then we will get a lot of standard concepts of varying benefit, cluttering the tables.
We make them non-standard and map them to single standard concepts what they really are.
We make them non-standard and create new single pre-coordinated standard concepts combining the information. We are doing this in the Oncology WG.
We make the variable or question non-standard, and alter the concept_name of the values/answers in such a way that they contain a concatenated concept_name of variables/questions with the values/answers. The AllOfUs folks did this for their PPI concepts.

Today, it’s a mess, and to clean it up smells like work to me. We got more than 8,000 of them.

The other, also important question is: When are these concepts Observations, and when are they Measurements?

Thoughts?

MPhilofsky · May 5, 2021, 2:21pm

They are Observations when we don’t have an exact-ish (few exceptions) start date or the date represents the day the fact was recorded. All facts you want to know happened for inclusion/exclusion criteria for a study, but can’t be certain of the date for the traditional cause/effect, before/after observational study. Examples: family history, most medical/surgical history, social history, survey data, etc.

aostropolets · May 6, 2021, 2:03am

The main issue as I see it is a conflict of interests: local data partners would like to preserve source structure to some extent so that they can use it as they’re used to. On the other hand, such model is completely non-transportable and non-transparent.

Hypertensive disorder is a great example, as nobody in a network study (or not even folks 5 years later in the same institution) will look for it in OBSERVATION table.

So a non-perfect yet comforting solution (which has been partially adopted at sites) is to store Q-A in OBSERVATION and the real record in real domain table (condition in CONDITION, drug in DRUG_EXPOSURE etc). That would also arguably allow to store negative data in OBSERVATION as well (like NO HYPERTENSION or NO ASPIRIN USE).

aostropolets · May 6, 2021, 2:05am

I’d actually ping here the folks from the related thread to share their perspective: @MaximMoinat, @Qubit, @Andrew

Christian_Reich · May 6, 2021, 9:29pm

!!! The world of the day. Perfect oxymoron.

I agree. And let them. For the network and the community it’s close to irrelevant.

Right. That is solution 2) and 3).

Alexdavv · May 22, 2021, 10:51am

Unfortunately, sometimes people do (especially for historical/contextual/lifestyle facts), but right - we have to get rid of it.

The most painful part is that they’re individual concepts, not the pairs. In order to use the non-Standard combinations in studies, we need to change the model by adding the value_source_concept_id field.

You don’t say that. Remember about guys developing a cohort.

Fine. In line with 1.

And the reason is the absence of one good Standard vocabulary. It’s not the case for most of the EAV data sources. SNOMED is all fine, we just need to enrich it on demand.

Doesn’t make a lot of sense for us. In some systems (LOINC) questions share the same answers. It means that you would need to duplicate the answers using the question’s code or even a random placeholder. That’s the same as a concatenation of both.
In UKB answers are not even independent concepts. They live only in the context of questions / answer lists.
But we need an uniform approach.

And many thousand when it comes to combinations.

I think this should be derived from the source rather than from the target Domain.
In UKB we applied a source-based algorithm to split them up into Variables, Questions, Values, Answers, and to assign a respective Domains according to OMOP conventions.

Depends on the source. Sometimes registries capture it explicitly.

But agree, those you mentioned are Observations.

Is the Q-A Standard or not? If not, the records would normally be collapsed in one in the target Domain table.

Where NO is an event_concept_id what we already do using the Clinical finding absent guys? Otherwise, we’re on the same position where many of the answers are NO, Not done, Not performed, etc.

What’s the thread?

Controversial to each other. Cannot be Standard and not at the same time.

If we do so, we just discredit the nature of standardization. For now, they’re just another couple of thousand concepts require standardization. Let’s not make them legal.