Friends:
The data in most of the OMOP CDM tables are statements of fact. But we have those EAV tables in which the information is organized in variable/value or question/answer pairs. The variable/question is always a concept, and the value/answer may or may not be a concept. I am talking about the ones where they are, and there are several sources of these:
-
Historic facts, such as Family history with explicit context combined with any Condition concept,
-
Tests in which the outcome is qualitative, for example the test SARS-CoV-2 (COVID-19) IgA+IgM [Presence] in Serum or Plasma by Immunoassay with the results Equivocal, Negative or Positive,
-
Published survey instruments, such as the PhenX Toolkit of complex diseases, phenotypic traits and environmental exposures or the Patient-Reported Outcomes Measurement Information System PROMIS or the American Physical Therapy Association Registry panel. LOINC sometimes indicates these in brackets,
-
Survey instruments created for projects, such as the UK Biobank or the All Of US Participant Provided Information (PPI),
-
Cancer staging and the assessment measures (Cancer Modifiers), which typically are published by Medical (Electronic Cancer Checklists from the College of American Pathologists) or Research Associations (Data Standards & Data Dictionary Volume II by NAACCR).
For 3)-5) and some of 2) the variables are explicitly paired to the values using relationships such as "Has answer". For 1) that is not the case.
Why do we have it that way? Why can’t we state them as fact like a Condition? For example, why not state “Positive SARS-CoV-2 (COVID-19) IgA+IgM [Presence] in Serum or Plasma by Immunoassay” and be done with it? Why do we need to split them up?
There are a few good reasons: First, if the relevant time stamp is on the answer, rather than on the question (like in 1). Second, if the number of values is so high that pre-coordinating the two would amount to duplication of the vocabulary. And three, for lab tests where the test and the test result are clearly different entities and their dual nature is intuitively expected.
But in many other constellations it is not that useful, because simple statements of facts are arbitrarily split into EAV pairs:
-
Many pairs contain the key information in the question, for example the pair Do you have high blood pressure [PhenX] - Yes.
-
Others have the information in the answer: Primary health condition - Cardio/pulm: Pulmonary hypertension.
-
Some have it in both parts of the pair, such as Admitted from Facility - Hospice.
Really, the above concepts should be Conditions (Hypertensive disorder, Pulmonary hypertension) or Visits (Hospice), to be recorded into their respective CDM tables. And then you have all those Don’t know and No answers that are flavors of null and need no record at all.
The problem is having information spread over variables, values or both makes them almost undiscoverable, unless you are explicitly working with a survey and know where to look. Nobody who is developing a cohort of patients with, say, Pulmonary Hypertension is going to look into the OBSERVATION or MEASUREMENT table and dig out some survey combo.
So, what should we do? How should we handle these?
-
We make all pairs non-standard. Then you can use them if you know they are there, but otherwise we don’t claim to be able to effectively utilize them.
-
We make them all standard (Observation or Measurement domain). Then we will get a lot of standard concepts of varying benefit, cluttering the tables.
-
We make them non-standard and map them to single standard concepts what they really are.
-
We make them non-standard and create new single pre-coordinated standard concepts combining the information. We are doing this in the Oncology WG.
-
We make the variable or question non-standard, and alter the concept_name of the values/answers in such a way that they contain a concatenated concept_name of variables/questions with the values/answers. The AllOfUs folks did this for their PPI concepts.
Today, it’s a mess, and to clean it up smells like work to me. We got more than 8,000 of them.
The other, also important question is: When are these concepts Observations, and when are they Measurements?
Thoughts?