CDM Metadata Improvement

ericaVoss · July 1, 2016, 1:53pm

@Patrick_Ryan, @schuemie, @Ajit_Londhe and I were tossing around some ideas on an improvement to the CDM_SOURCE table in the current CDM. We are suggesting adding a column for each domain so that we could add additional information about each domain. For example, CDM_SOURCE.VISIT_OCCURRENCE could be used to have a few sentences about how the VISIT_OCCURRENCE logic works for that specific CDM implementation. Then on applications like ACHILLES we would be able to display that information at the point where the user is exploring the data. This is an improvement to having users go find the documentation on GitHub for the CDM. Of course we will bring this up to the working group but wanted to see if people on the forums had any ideas as well.

Here is an active example for some of the JMDC domains:

Database as a whole
JMDC database consists of data from 60 Society-Managed Health Insurances covering workers aged 18 to 65 and their dependents (children younger than 18 years old and elderly people older than 65 years old). The old people (particularly those aged 66 or older) are less representative as compared with whole population in the nation. When estimated among the people who are younger than 66 years old, the proportion of children younger than 18 years old in JMDC is approximately the same as the proportion in the whole nation. JMDC data includes data on membership status of the insured people and claims data provided by insurers under contract. Claims data are derived from monthly claims issued by clinics, hospitals and community pharmacies.

Person
JMDC covers workers aged 18 to 65 and their dependents (children younger than 18 years old and elderly people older than 65 years old). The old people (particularly those aged 66 or older) are less representative as compared with whole population in the nation. When estimated among the people who are younger than 66 years old, the proportion of children younger than 18 years old in JMDC is approximately the same as the proportion in the whole nation.
Only the year of birth is available, so not the day or month.

Observation_period
The observation period is defined as the time of enrollment in the health insurance. If the member is a dependent, the enrollment depends on the enrollment of the main beneficiary.

Care_site
Care sites in JMDC are institutions where care is provided, typically a department in a hospital.

etc. …

@Vojtech_Huser already had a metadata suggestion and we added on to it:
http://www.ohdsi.org/web/wiki/doku.php?id=documentation:next_cdm:metadata

Patrick_Ryan · July 1, 2016, 2:36pm

Thanks @ericaVoss for initiating this discussion. I DEFINITELY LOVE the
idea of capturing domain-specific metadata that can describe how/where/why
source data makes it into each of the CDM domains, and think the JMDC
examples are spot on. It’d be really awesome to see the meta-data exposed
throughout ACHILLES and ATLAS and wherever we are making a choice about
data, so the researcher can have the context for what they are doing. I am
comfortable with a simple solution of adding a new field to CDM_SOURCE
called DOMAIN_ID, whereby we would establish a convention that there would
be one record with DOMAIN_ID = NULL that contains the information for the
database as a whole, and then one record for each domain that provides a
table-level descriptive summary.

The minor technical points that I don’t have fully resolved in my own mind:

is ‘domain’ the right word to use here, or are we really talking about
just tables? Domain has a specific meaning that doesn’t perfectly align
with the tables, though its close (and perhaps the action is just to
auigment the vocabulary to make the alignment more explicit). Specific
examples; a) do we need to differentiate meta-data between DRUG_EXPOSURE
and DRUG_ERA (both of which cover domain of DRUG)? b) We certainly want to
have meta data for the logic associated with OBSERVATION_PERIOD, but
currently there is no domain id for that. c) there are domains for fields
rather than tables (e.g. gender/race/unit), but do we need/want/expect
meta-data at that level?
the fields in the CDM_SOURCE table were originally designed for housing
the information about the database as a whole. Are all of those fields
relevant or useful if you have records at the domain level? If not, an
alternative solution, which I’m definitely not advocating for but just
putting out there, would be to have a table specifically for the
domain-specific metadata descriptions, CDM_SOURCE_DOMAIN, DOMAIN_ID,
METADATA_DESCRIPTION.
Ideally, metadata descriptions would be complemented with more extensive
documentation (source data dictionaries and descriptions, ETL
specifications and source code, etc.). If we are serious about improving
the transparency and reproducibility of research, these types of artifacts
should be exposed publicly so that both users of the data and consumers of
the evidence can have adequate context. However, I sense that not
everyone is fully comfortable with sharing these materials, or worse yet,
these items may not even exist in some circumstances. How can/should we
enforce some conventions or minimal standards for metadata that the
community can adhere to?

schuemie · July 6, 2016, 8:11am

Taking a step back: metadata should help the user understand how the data came about. The main reason for this is to help users understand under which circumstances the data might not truly reflect what is happening to patients. For example, for a hospital database is would be good to note that only diagnoses made in a hospital setting are recorded, and no diagnoses in primary care are available.

As a general rule, the metadata should help users answer these questions:

Why was data collected? (e.g. for billing purposes)
Under what circumstances would health care events go unrecorded in the data? (e.g. not capturing diagnoses in primary care, not capturing drugs prescribed in hospitals)
Under what circumstances would data reflect something that isn’t there? (e.g. coding of myocardial infarctions at follow-up visits)
Under what circumstances would data just be inaccurate? (e.g. visits in the CDM never span more than one calendar month, even if real visits do)

I’m not sure my JMDC example cited by @ericaVoss above completely conveyed those ideas Specifically, I think this metadata is different from what would normally be in an ETL document, since the ETL document focuses on transforming the data from one format to another, not on the origin of the data itself.

In my opinion, domains are the right level, since tables are just a technical implementation of the domains.

Chris_Knoll · July 6, 2016, 2:26pm

Just a clarification on the CDM schema change @ericaVoss described above:

I don’t think we should alter the CDM_SOURCE table for a couple of reasons:

Adding those description columns to the single row that usually appears in CDM_SOURCE will run into max-rowlength limitations on certain platforms.
Specifying the specific DOMAINS the metadata is captured for makes it clear in the schema what data we are capturing, but means that if we want to capture a new domain we have to change the schema of CDM_SOURCE.

Therefore, I suggest a table: CDM_DOMAIN_META with the following structure:

CREATE TABLE CDM_DOMAIN_META (
  DOMAIN_ID varchar(20),
  DESCRIPTION varchar(4000)
)

This will allow us to have 1 row per domain_ID, we can store many domains without running into a max-rowlength issue, and if we ever want to capture a new domain_id meta later, the schema doesn’t change.

-Chris

ericaVoss · July 17, 2016, 6:21pm

@Chris_Knoll, @schuemie & @Patrick_Ryan,

Do we want to use concepts to define DOMAIN_ID or just free text? Given the above I’m assuming just free text (e.g. PERSON, OBSERVATION_PERIOD, etc).

I don’t think we have the concepts we need in the DOMAIN vocabulary anyway.

CONCEPT_ID	CONCEPT_DESCRIPTION	DOMAIN
8	Visit	VISIT_OCCURRENCE
10	Procedure	PROCEDURE_OCCURRENCE
13	Drug	DRUG_EXPOSURE
19	Condition	CONDITION_OCCURRENCE
21	Measurement	
27	Observation
55	Provider
56	Person
57	Care site

This doesn’t include OBSERVATION_PERIOD or DEATH.

Patrick_Ryan · July 17, 2016, 7:14pm

I would recommend, as part of our proposal to cdm workgroup to ratify the
change to the cdm that we also request that the vocabulary have a complete
set of domain concepts, one per table.

In the new meta data table, we should use the domain field, not its
corresponding conceptid.

ericaVoss · July 18, 2016, 1:40am

I don’t get this last statement. What do you mean by “domain field”?

Patrick_Ryan · July 18, 2016, 1:51am

In CONCEPT, there is a field DOMAIN_ID, which uses a varchar identifier to
uniquely specify the domain value. We should follow suit in the new
CDM_SOURCE_DOMAIN table to have a DOMAIN_ID field, and use the same
DOMAIN_ID values as observed in the CONCEPT table.

ericaVoss · July 18, 2016, 2:02am

@Patrick_Ryan - Just so I’m clear, do you see us leveraging VOCABULARY_ID = ‘DOMAIN’ and just adding what we need or generating a new Vocabulary called CDM Tables?

Patrick_Ryan · July 18, 2016, 2:18am

The CONCEPT_IDs that belong to VOCABULARY_ID = ‘DOMAIN’ is just to preserve
complete definition of all entities throughout the model, something @rimma
has been pushing for that I also strongly support. But as @Christian_Reich
has implemented, which I quite like, we have DOMAIN_ID field in the CONCEPT
table that uses a varchar ID, so that you can know what you are looking at
without having to join back to the CONCEPT table to do the lookup.

Currently, the CONCEPT_IDs in VOCABULARY_ID = ‘DOMAIN’ is not complete for
all domains in the CDM, but rather it only provides concepts for those
domains which have their own standard concepts assigned. However, I
wouldn’t be surprised if that changed at some point too.

So, I propose in the vocabulary that we add CONCEPT_IDs for the following
values:

CONCEPT_NAME = ‘Observation period’, DOMAIN_ID = ‘Metadata’
CONCEPT_NAME = ‘Death’, DOMAIN_ID = ‘Metadata’
CONCEPT_NAME = ‘Cost’, DOMAIN_ID = ‘Metadata’

And then, in the CDM_SOURCE_DOMAIN table, we will have a field called
DOMAIN_ID (datatype: varchar(20)), and we will follow the same convention
as is used in CONCEPT, with each row in the table using the varchar
identifier for each table…

Christian_Reich · July 18, 2016, 5:15am

Friends:

No problem adding those.

But are you really thinking of a generic yada yada yada field explaining something that won’t be machine readable, and in many cases human-readable either? Shouldn’t we try to model the various situations, even if we don’t get everything under control right away?

I am thinking:

Domain (as you describe above)
Sampling (random, biased)
Bias by (field, value) with some ability to do “between numbers” and “list”, “children of”

So, in th case of JMDB you’d get:

Person
Biased
Year of birth 18-65

In the case of a, say, Psoriasis survey it would bias it by a the concept_id for psoriasis and all it’s children.

Thinking out loud here. We should make that searchable. Let me know.

bailey · July 18, 2016, 3:46pm

Just to add another bit of prior art to the topic of overall metadata capture, here’s PCORnet’s current attempt to describe what they think are the key metadata elements for fact tables:

CREATE TABLE harvest (
	admit_date_mgmt VARCHAR(2), 
	birth_date_mgmt VARCHAR(2), 
	cdm_version NUMERIC(10, 2), 
	datamart_claims VARCHAR(2), 
	datamart_ehr VARCHAR(2), 
	datamart_name VARCHAR(20), 
	datamart_platform VARCHAR(2), 
	datamartid VARCHAR(10) NOT NULL, 
	discharge_date_mgmt VARCHAR(2), 
	dispense_date_mgmt VARCHAR(2), 
	enr_end_date_mgmt VARCHAR(2), 
	enr_start_date_mgmt VARCHAR(2), 
	lab_order_date_mgmt VARCHAR(2), 
	measure_date_mgmt VARCHAR(2), 
	network_name VARCHAR(20), 
	networkid VARCHAR(10) NOT NULL, 
	onset_date_mgmt VARCHAR(2), 
	pro_date_mgmt VARCHAR(2), 
	px_date_mgmt VARCHAR(2), 
	refresh_condition_date DATE, 
	refresh_death_cause_date DATE, 
	refresh_death_date DATE, 
	refresh_demographic_date DATE, 
	refresh_diagnosis_date DATE, 
	refresh_dispensing_date DATE, 
	refresh_encounter_date DATE, 
	refresh_enrollment_date DATE, 
	refresh_lab_result_cm_date DATE, 
	refresh_pcornet_trial_date DATE, 
	refresh_prescribing_date DATE, 
	refresh_pro_cm_date DATE, 
	refresh_procedures_date DATE, 
	refresh_vital_date DATE, 
	report_date_mgmt VARCHAR(2), 
	resolve_date_mgmt VARCHAR(2), 
	result_date_mgmt VARCHAR(2), 
	rx_end_date_mgmt VARCHAR(2), 
	rx_order_date_mgmt VARCHAR(2), 
	rx_start_date_mgmt VARCHAR(2), 
	specimen_date_mgmt VARCHAR(2), 
	CONSTRAINT xpk_harvest PRIMARY KEY (networkid, datamartid)
);

(The _date_mgmt columns capture info about date shifting.)

I’m not particularly a fan of the wide table – this seems like the right case for a return to the EAV structure that was initially part of the cdm_source proposal – but thought the additional use case might be helpful.