OHDSI Home | Forums | Wiki | Github

That's so meta

There have been a number of discussions about providing metadata for our data objects (concept sets, cohort definitions, etc). In my recent post, towards specifications, I referenced a potential specification approach using Open API 3.0 for the OHDSI service layer which included a limited set of metadata for objects.

I’ve been doing additional research on what metadata standards we might hold ourselves to and so far the most relevant guideline I have found is A Proposed Standard for the Scholarly Citation of Quantitative Data.

In the paper the suggest a minimum citation standard with 6 data elements.

We propose that citations to numerical data include, at a minimum, six required components. The first three components are traditional, directly paralleling print documents. They include the author(s) of the data set, the date the data set was published or otherwise made public, and the data set title. These are meant to be formatted in the style of the article or book in which the citation appears.

The author, date, and title are useful for quickly understanding the nature of the data being cited, and when searching for the data. However, these attributes alone do not unambiguously identify a particular data set, nor can they be used for reliable location, retrieval, or verification of the study. Thus, we add three components using modern technology, each of which is designed to persist even when the technology changes: a unique global identifier, a universal numeric fingerprint, and a bridge service. They are also designed to take advantage of the digital form of quantitative data.

For those who are interested I suggest reviewing the complete paper however I would like to recommend that we adopt the minimum citation standard they recommend. It will meet the requirement of FAIR as suggested by @Juan_Banda and also provides guidance on including both GUID and hash components as per our recent discussions that came about from @jon_duke’s thread on github.

This would give us a metadata object requirement in the following form:

{
  "title": "name of object",
  "author": "name of author",
  "createDate": "date it was created",
  "uuid" : "unique identifier",
  "hash" : "universal numeric fingerprint",
  "uri" : "bridge service or universal resource locator"
}

I think this is a good starting point, but might suggest allowing for multiple authors and some specific author details as well as adding a property for the version of a data object as well as for the ohdsi specification version as the ohdsi specification could change over time, this would assist in knowing the compliant version of this object.

{
  "ohdsi": "1.0.0 - the ohdsi services specification version of this object",
  "title": "name of object",
  "authors": [
    {
      "name":"frank",
      "email":"f@foo.com",
      "publicKey":"gpgkey"
    }
  ],
  "citations": [
    { "make use of the citation.js library"}
  ]
  "createDate": "date it was created",
  "uuid" : "unique identifier",
  "version": 1,
  "hash" : "universal numeric fingerprint",
  "urn" : "http://www.ohdsi.org/uuid/version"
}

I’ve been having some offline conversations with @anthonysena and @gregk on the importance of metadata so I look forward to their feedback as well as others in the community.

My proposal is to use this metadata as the standard for all data objects and require that all data objects that we create include a metadata property in this form in addition to the data itself.

{
  "metadata": "as described above",
  "data" : "unique specification for each data object"
}

If we could come to an agreement on the metadata specification then we can turn our attention to the data object specifications.

More information on the citation.js library and their defined schema for citations can be found here.

t