Hi @mjg . The All of Us Research Program relies on regular updates to the vocabulary for two disparate feeds data: survey data and EHR data. Ultimately these two data feeds wind up in the same data product so they must be associated with a single vocabulary release (i.e. version).
The project develops several survey instruments which are modeled using the OMOP vocabulary in a custom terminology called the PPI. An updated release is issued ~4 times a year- although recently there might’ve been more frequent updates due to new COVID surveys.
AOU depends on several terminologies in order to consume EHR data submitted by dozens of healthcare organizations around the country. When we first started we endeavored to refresh the vocabulary maybe twice a year to keep things up to date, but we’ve had to do things more frequently at times (e.g. whenever desirable COVID-related concepts got added).
One downside we’ve identified is we need to do a full refresh of the vocabulary every time. My understanding is this is because there may be intricate relationships that could be difficult to tease out if we wanted to say, just update the PPI and nothing else. Much to their credit, Odysseus did recently give us some SQL scripts to try such an “in-place” update, but we have not had a chance to try this out yet .
-
Has any one done historical tracking of the vocabularies? Versioning each download? Audit history?
We keep all historical versions of the vocabularies that have been in use. We also have internal concepts that never make it to Athena- we automatically assign a version (i.e. in vocabulary.vocabulary_version
) by creating a hash out of the input data.
-
Do you download all vocabularies every single time or have you created separate downloads?
We download all vocabularies every single time. We used to rely on a single “bundle” in Athena. It automatically gets archived on the website- so we would “restore” it. But we recently discovered that some of the OMOP Extension concepts had disappeared from the download and would only re-appear after we created an identical bundle in Athena (i.e. with the same vocabularies checked off).
-
Do you keep track of any addition/changes between vocabulary updates?
As you can imagine for the PPI we are actively tracking changes because we are authoring the terminology itself. There are changes we wished we would keep better track of. For example, there is an implicit hierarchy dictated by concept_relationship
records associated with PPI; we are not sure how well this is being maintained. Similarly with relationships between questions and answers. There are also instances where codes the PPI team submitted for addition were not compliant with OMOP vocabulary / Odysseus practices and so they had to be modified.
-
Are you updating right away? Do you download on a cadence?
We do not update right away. There have been occasions where the data validation routines we apply to EHR data submitted to us would be affected by a vocabulary update. It’s also kinder of us to ensure sites who are submitting data have access to the vocabulary release we will be using for evaluating their submissions. No magic bullet here (that I know of)- just careful communication.
Ok I’m going to stop myself here. Happy to do a web call or something if that might be useful.