@hripcsa @Patrick_Ryan @Christian_Reich @aostropolets @mik @gregk @agolozar @jmethot @Andrew @Adam_Black @Paul_Nagy @Jake @kzollove
I’ve done some asking around and those that were tagged were either suggested for this conversation or requested to be.
I’m looking to gain some insight and start a conversation about the future of contributing to the OHDSI vocabularies after the ‘vocabulary shakeup’, or, the Odysseus vocab team narrowing it’s scope, has been completed. I’m not sure if these conversations are happening already or not, and if they are I’d love to be looped in.
For context, we (Minderoo Foundation) are potentially interested in taking up the role of the facilitating the vocabulary development in the Oncology space, but looking for more context as to what that might entail before signing up.
One aspect in particular I’m curious about is the feasibility of taking this opportunity to transition the OHDSI vocabulary management system into a more transparent, open source approach, and whether others consider that to be as valuable as I do.
Right now, as I understand it, the bottleneck is that there is a small team that is handling too much, and clearly that shouldn’t be the case. My concern with keeping the same system but relying more on community contributions, while still using the same process of “submit a ticket with the information and we’ll take care of it” still burdens that same group with processing and may lead to further bottlenecks down the road, as well as the remaining lack of transparency as mentioned above.
What I mean by that is, the current repo contains the code to build the vocabularies but does not store either the source data nor the output. If one was trying to make a simple fix, say change a mapping that is clearly incorrect, you can’t simply: open the data containing the CONCEPT_RELATIONSHIP records, make an edit, and then create a pull request .
For a community that is almost entirely open source, this is quite an exception to have so much “under the hood”, but the existence of it makes sense given the constraints. There are many mechanisms that the wonderful vocabulary team has put in place to curate this data, validate it, and populate the tables around it (i.e. CONCEPT_ANCESTOR) during the build process, and I’m sure many additional others pieces I don’t comprehend.
To get to the point, would it be entirely impossible to create a repository of the output of the current repository - the data in which you would get if you pulled from ATHENA - as a starting point, and allow for pull requests to change CONCEPT and CONCEPT_RELATIONSHIP (and others if needed) that would then be reviewed by the powers that be (mods). There could potentially be github actions to perform data quality checks on the PRs, as well as others actions or similar for such things like rebuilding the CONCEPT_ANCESTOR table.
BUT - the main issue, as I understand it, is that you can’t store large data in a Github repo, so the gigabytes of data that comprise the computed standard vocabularies, equivalent to what was described above, is not feasible to store in a Github repo, given the 5GB limit.
Or… is it? In discussing with a colleague he brought up one potential solution, leveraging “datalad”, and consequently git-annex, which could potentially fit the bill : https://www.datalad.org/
Relevant piece of the docs: Delineation from related solutions — DataLad 0.18.2+0.gac794a2af.dirty documentation
I haven’t dove deep enough into the documentation to validate that it indeed has all of the needed functionality but at a glance it looks quite promising. That is, should others agree this is a valid and worthwhile endeavor for the community to pursue.