Metadata extension to CDM

Vojtech_Huser · October 4, 2016, 5:15pm

I would like to announce a proposal to extend metadata in the CDM.
The proposal is described at the CDM workgroup website

http://www.ohdsi.org/web/wiki/doku.php?id=documentation:next_cdm:metadata

The proposal is motivated by our efforts to look at data quality of various tables. And a free text description of several domains and the type of dataset (general population vs. something special) would be a good addition to the CDM.

Overview of all other proposals is also available here (the proposal above is listed as ‘metadata’)
http://www.ohdsi.org/web/wiki/doku.php?id=documentation:next_cdm

ericaVoss · October 4, 2016, 5:33pm

@Ajit_Londhe and I would like to participate in this discussion. I think the second description listed on the Wiki is outdated from what we have implemented on our side. @Ajit_Londhe even developed a ACHILLE report exposing the domain notes from this data we are testing out.

However even since implementing this beta idea on our side @Ajit_Londhe and I have had other ideas about generalizing the table further for storage of other Metadata about the CDM (e.g. such as CDM run times).

@Ajit_Londhe and I will meet up next week to do a better job of documenting our ideas. We are open to other thoughts and input.

Christian_Reich · October 4, 2016, 6:05pm

@ericaVoss:

Please add the link to this Forum to hte Wiki page proposal, so people can find it.

Also, can you invite Dino? He has a lot of ideas and needs for metadata.

dgambone · October 4, 2016, 6:27pm

@ericaVoss yes, I’d be interested in participating as well.

ericaVoss · October 4, 2016, 7:00pm

@Christian_Reich - done, @Vojtech_Huser beat me to it.

t_abdul_basser · October 4, 2016, 7:26pm

Interested as well.

gregk · October 5, 2016, 12:52am

please sign me up for this as well. Thanks

ericaVoss · October 12, 2016, 1:58am

@Vojtech_Huser, @dgambone, @t_abdul_basser, & @gregk,

@Ajit_Londhe and I drafted our ideas in the Wiki. We wanted to pass them by Vojtech first since he has been thinking about this the longest. Then I figure we could send out the notes to all of you and discuss via email or set up a meeting.

Long way of saying . . . we are still thinking about this but getting some thoughts together first - I think we’ll have a more productive discussion with something tangible to provide feedback on.

Vojtech_Huser · October 13, 2016, 6:22pm

I like the new update. For others puzzled by what concept_id there are to use in the METADATA table, do advanced search in Atlas (under vocabulary) like this (pick OMOP Domain)

ericaVoss · October 18, 2016, 3:04pm

Finally got my act together here is the MetaData Doodle for when we could meet to discuss:
http://doodle.com/poll/hs9p2rkmqixdhp96

@Vojtech_Huser, @Ajit_Londhe, @dgambone, @t_abdul_basser, & @gregk please let me know which date/times work for you.

ericaVoss · October 20, 2016, 1:13am

Looks like 10/26 @ 3:00PM.

@gregk can you PM me your email so I can send the meeting invite.

ericaVoss · October 26, 2016, 2:30am

Does anyone have @gregk’s email?

Hope to see everyone at 3PM EST tomorrow! If you don’t have the meeting invite on your calendar let me know!

Daniella_Meeker · October 26, 2016, 7:52am

Timely. Just back from metadata conference where one “next-step” subject was alignment of post-11179 metadata standards. Also, we did some metadata modeling for ONC pilot - not super proud if it, but a start.

I’m on fumes as bandwidth goes, but perhaps we can get someone from Columbia and Eric+Josh to participate in metadata discussion here. I opened the topic last week w/ PMI.

ericaVoss · October 26, 2016, 10:46am

Daniella, email me who you want me to invite and I will.

Vojtech_Huser · October 26, 2016, 3:35pm

@ericaVoss, can you please add to the proposal two more examples for what values could go into the column METADATA_TYPE_CONCEPT_ID.

Also, if a concept exist (e.g., visit type) - do we expect one row per METADATA_CONCEPT_ID or people can still submit multiple name-value pairs. (and one of the names will equal the concept name).

If METADATA_CONCEPT_ID does not exist for a metadata entity (e.g., death table was using using 2014-June state death certificate data) - do we populate METADATA_CONCEPT_ID with concept of 0?

ericaVoss · October 27, 2016, 12:51am

Per today’s meeting I added all the domains to the examples. Vojtech will prepare to propose at the next CDM team meeting.

Vojtech_Huser · October 31, 2016, 4:32pm

I wasked asked in email (@schillil) about ability to attach metadata to columns. The current proposal provides a shell and “Athena terminology” provides concepts. So to comment on column, one can use concepts for that. E.g., visit type.

See the highlighted concept below. (there are 54 concepts at the moment to pick from) (but we anticipate additional concepts created per requests of “metadata documenters”). Again, we want to give a generic tool for metadata that is “extensible” as we need.

Vojtech_Huser · March 7, 2017, 6:43pm

Daniela mentioned today other metadata literature "Metadata as DDI or ISO-11179. Also, HL7 standards for metadata.

Rimma proposed to make make metadata table like observation table (value_as_string, value_as_concept_id)

Should the scope be computable metadata or metadata for humans to read about.

Vojtech_Huser · April 4, 2017, 7:40pm

I would like to continue the metadata discussion at the upcoming CDM WG call.

I created a modified table proposal that possibly addresses some of the points raised during the last discussion

The key is not to confuse metadata with data characterization as done by Achilles. (achilles_results table). An ETL or data warehouse insider knows a lot about a warehouse and the point of metadata is to put some of this “insider” knowledge into metadata - so that a user (or analyst) can get quickly “semi-intimate” with the data by just reading some some smart and organized notes made by the insider.

Perhaps we can propose a shell and let the community decide how to best use this shell to put some useful metadata content into it and in phase2 made metadata tighter and better. The perfect should not be enemy of the good in this phase 1.

Perhaps every WG member can provide examples of metadata that they would like to capture (and post here).

Mine would be:

dataset is updated once a year (or monthly or …)
dataset reflects only data from clinical trial (not routine care)
Achilles is executed after each data refresh. achilles_results are always available
dataset has drug order data as well as pharmacy dispensation data (can study ‘patient did not fill his prescription’ questions)
weight data comes from Health Risk Assessment done by health plan (not from EHR)
PHR data is present (in OBSERVATION table) but not mapped to any standard concepts and there are no plans to do this mapping
procedure data are in local codes only (any phenotype using procedures has to tweak the standard code in the phenotype to the local code)
dataset has EHR data only (and no claims data; site has no affiliated health plan)
dataset has claims data with “sparse lab data” (e.g., all that come from an accessible source, such as LabCorp). Such data does not reflect all lab data. Inpatient lab results are not present. (not available)

UPDATE: after CDM WG April meeting - the proposal was updated with phase 1 and phase 2 scope and use cases were updated.

Vojtech_Huser · April 28, 2017, 9:15pm

To continue the discussion, I will try to tag folks to contribute example metadata they see as important.

@Christian_Reich @ericaVoss @Sigfried_Gold @rimma
This was my original input seeking post:

Perhaps every WG member can provide examples of metadata that they would like to capture

Examples from OHSI github are:

github.com

OHDSI/ETL-CDMBuilder/blob/cb41fba91fc8374c1566c82f1bb78b4f6e1b4d32/source/Builders/org.ohdsi.cdm.builders.premier_v5/PremierV5/CdmSource.sql#L30


		   'Premier is a US hospital database that houses data on the inpatient and outpatient visits of 119 million people from 619 hospitals since 1999. The data represent 1 in 5 inpatient hospital stays in the US. It is a visit-centric, billing database where each visit is linked with a unique billing record. The process by which Premier is transformed to the patient-centric OMOP CDM is described in detail by Makadia and Ryan (2014) here: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4371500/. Premier is not an insurance claims database.',
			'http://hicoe.jnj.com/DataSources/Premier',
			'http://www.ohdsi.org/web/wiki/doku.php?id=documentation:example_etls',
			'{0}',
			'{3}',
			'V5.0',
			'{1}'
		   );




INSERT INTO {sc}.CDM_DOMAIN_META
      (DOMAIN_ID, DESCRIPTION)
VALUES
      ('Person','Premier covers individuals from 0-90 years of age. Persons of 0 years of age are disproportionately represented because of in-hospital births. Persons of 90 years of age are overrepresented because the data capture system truncates age at 90 years. The covered population is 56% female. Race data include Black/African American, White, and other race and ethnicity data include an indicator for Hispanic/Latino.');




INSERT INTO {sc}.CDM_DOMAIN_META
      (DOMAIN_ID, DESCRIPTION)
VALUES
      ('Visit','Admission and discharge date information for a patient visit is recorded as month and year only, with the day set as the first of the month. Billing information for the visit includes the number of days since admission that a billable item or service was provided, so the maximum value of the service day is the length of stay. The order of multiple visits within a month is preserved by visit sequence information. For a subsequent visit in a month, the admission day is set as one day after the discharge date (i.e. maximum service day value) of the previous visit. Where admission date and discharge date are different months, the maximum service day value is subtracted from the discharge date to obtain the length of stay. The specific day of the visit start date and visit end date is not necessarily accurate, nor is the interval between visits. This logic assures the sequential order of visits and length of stay is accurate.');

A search for CDM_DOMAIN_META on github shows many other examples.

It seems like the table has 2 columns (“name value pairs”). Can someone post the DDL for the CDM_DOMAIN_META table (or where it can be found) @ericaVoss ?

The examples above again indicate the mode of “how to get semi-intimate” with a dataset by reading some smart metadata notes done by the dataset custodian.