Seeking clarity on process for modifying non-"core" vocabularies moving forward

rtmill · March 2, 2023, 1:47am

@hripcsa @Patrick_Ryan @Christian_Reich @aostropolets @mik @gregk @agolozar @jmethot @Andrew @Adam_Black @Paul_Nagy @Jake @kzollove

I’ve done some asking around and those that were tagged were either suggested for this conversation or requested to be.

I’m looking to gain some insight and start a conversation about the future of contributing to the OHDSI vocabularies after the ‘vocabulary shakeup’, or, the Odysseus vocab team narrowing it’s scope, has been completed. I’m not sure if these conversations are happening already or not, and if they are I’d love to be looped in.

For context, we (Minderoo Foundation) are potentially interested in taking up the role of the facilitating the vocabulary development in the Oncology space, but looking for more context as to what that might entail before signing up.

One aspect in particular I’m curious about is the feasibility of taking this opportunity to transition the OHDSI vocabulary management system into a more transparent, open source approach, and whether others consider that to be as valuable as I do.

Right now, as I understand it, the bottleneck is that there is a small team that is handling too much, and clearly that shouldn’t be the case. My concern with keeping the same system but relying more on community contributions, while still using the same process of “submit a ticket with the information and we’ll take care of it” still burdens that same group with processing and may lead to further bottlenecks down the road, as well as the remaining lack of transparency as mentioned above.

What I mean by that is, the current repo contains the code to build the vocabularies but does not store either the source data nor the output. If one was trying to make a simple fix, say change a mapping that is clearly incorrect, you can’t simply: open the data containing the CONCEPT_RELATIONSHIP records, make an edit, and then create a pull request .

For a community that is almost entirely open source, this is quite an exception to have so much “under the hood”, but the existence of it makes sense given the constraints. There are many mechanisms that the wonderful vocabulary team has put in place to curate this data, validate it, and populate the tables around it (i.e. CONCEPT_ANCESTOR) during the build process, and I’m sure many additional others pieces I don’t comprehend.

To get to the point, would it be entirely impossible to create a repository of the output of the current repository - the data in which you would get if you pulled from ATHENA - as a starting point, and allow for pull requests to change CONCEPT and CONCEPT_RELATIONSHIP (and others if needed) that would then be reviewed by the powers that be (mods). There could potentially be github actions to perform data quality checks on the PRs, as well as others actions or similar for such things like rebuilding the CONCEPT_ANCESTOR table.

BUT - the main issue, as I understand it, is that you can’t store large data in a Github repo, so the gigabytes of data that comprise the computed standard vocabularies, equivalent to what was described above, is not feasible to store in a Github repo, given the 5GB limit.

Or… is it? In discussing with a colleague he brought up one potential solution, leveraging “datalad”, and consequently git-annex, which could potentially fit the bill : https://www.datalad.org/

Relevant piece of the docs: Delineation from related solutions — DataLad 0.18.2+0.gac794a2af.dirty documentation

I haven’t dove deep enough into the documentation to validate that it indeed has all of the needed functionality but at a glance it looks quite promising. That is, should others agree this is a valid and worthwhile endeavor for the community to pursue.

aostropolets · March 2, 2023, 4:25pm

@rtmill once there will be guidance you will be among the first people to learn about it

As you know, we have been conducting landscape assessment of the community needs (thanks for contributing to it!) and so far got a pretty amazing cohort of more than 180 people responding to the survey, >50 data sources covered and numerous interviews. Now we need to thoroughly analyze the data to determine our strategy moving forward. So stay tuned!

Going back to the notion of pull requests and open source. One of the top priorities indeed is increased transparency of the process and content.

Is storing the content on GitHub and handling its changes through pull requests feasible?
Right now we have a build system, where we download the source vocabularies (100+), store them, process them to harmonize into the OMOP Vocabularies format, run integrity and semantic checks, assign ids, build hierarchy (as it is our own hierarchy we must dynamically build it with each release to ensure integrity) and only then release the tables you download from Athena.

Having large files on GitHub is one problem. The fact that the vocabularies are constantly updated (their concepts and relationships that we further process) from their respective sources is another. Building our own relationships and hierarchy is yet another. It would be very interesting to dive into details of how you envision your process to work given the complexity of the system. If you have some examples of how other open-source ontologies work in this way they would also come handy.

hripcsa · March 2, 2023, 7:20pm

Thanks Robert and then Anna. Agreed. I think the point is not specifically GitHub or not, but achieving and open and usable platform.

rtmill · March 8, 2023, 8:26pm

@hripcsa I agree.

@aostropolets Thank you for the thorough response. I’m looking forward to the discussions when that time comes.

Regarding the complexity of the operations that go on behind the scenes of the current implementation, are there any documents available that could be shared to provide insight and consequentially help gauge potential solutions?

aostropolets · March 8, 2023, 10:55pm

Yes, I remember we own you (and the community) one . I think the overview of the development process, albeit not in the text format, can be seen here. We of course need to translate it into proper documentation.

From these basics common for all vocabularies, we go into vocabulary-specific procedures that are sometimes unique to domain/vocabulary. For example, incorporation of non-US drugs is performed using a systematic approach described here to ensure that have standard representations for messy attributes of disparate international drugs (proper ingredient names, nice dose forms, all in English, etc), that those attributes are linked to the drug codes appropriately (to mimic RxNorm structure) and, finally, that we build cohesive RxNorm-RxNorm Extension hierarchy.

In fact, drug domain can be used as an example in our future discussions as we have some documentation on dev process and it is a fairly well-established process.,