OHDSI Home | Forums | Wiki | Github

API to automate downloading vocab files?

Has there ever been any discussion about automating the request for vocab files? It is easy to fill in the web page, but we’re building a repeatable process around our vocabulary maintenance, and having the ability to grab the data when needed would be very useful.

@donohara:

YES, PEASE. Including one that looks into the user tables and prvents sending those that require a license. Even better still, including a more sophisticated user table that allows to have some users register their licenses.

Currently, we have a so-called “back-door” way. If you have all the licenses I’ll give you an ssh-account and you can use the dumper (the script that creates the txt files) to pull out anything you want. But that is not something we can just open up.

Well, it is not the full solution but I have partial suggestion: Can we have a zip file (already pre-made) - refreshed wehne there is a new release - that only contains the non-licensed terminologies?

That would allow some automation.

@Vojtech_Huser

Good idea. Will do.

@Christian_Reich

I am new to this , so not sure what the proper mode of communication is and hence asking information an old post related to something i am looking for.

As of now we submit a form every time to get the Vocab data which includes some of the licensed vocab data as well. I am interested in automating this process and was looking for available solutions on this forum. The ssh-account and data dumper script sounds like an ideal solution and was wondering what the process is to have this setup. Could you please help me with some information on the above stated?

1 Like

@anoop.yamsani:

The SSH/dump.pl solution was a workaround for folks who couldn’t get the licensed vocabularies from the Athena webistes before we built in the mechanism to manage those permissions. It really isn’t safe, because it circumvents it. We could do two things:

  • Build a Open Source scraper to go against Athena
  • Buid an API

Any appetite?

Hey Christian! Thanks for the information! I would love to build an API but at the moment we are swamped with work as we recently stated using new tools and reworking our complete database. If and when we get done with the other work (which takes priority for now), i will start looking into building an API. It while take a while to get there.

Sorry to revive an old thread, but was there ever any progress made on this topic?

Tagging @gregk on this - ATHENA is the new tool for obtaining vocabulary files and I believe they are working towards a push model for the vocabulary as well. Not sure if there is a direct API call but that sounds like a good idea.

I guess there are a few things that happened since this original post:

  • ATHENA v2 was released. It is a completely revamped design and
    architecture
  • ATHENA now required a user login and keeps a track of all downloads, as well as license sign offs

at this point, we have not discussed opening an API that would allow users to download vocabularies. it might not be a bad idea, but we need think about it in the context of the #2 and the actual process of selecting and creating a bundle. Also, as @Frank mentioned above - we were entertaining a slightly different idea of pushing, or notifying users as a start, of new vocabularies becoming available based on their history.

If you would like to share how you envision the API that would allow you to download would work?

Would like to revive this thread to see if there has been any progress.

I would like to take github/travis to build synthetic datasets using synthea and then converting them to omop but would need an automated way to get the vocab.

I would expect that we could setup the portal so that users can get an API token when they check the licenses approval box. Then they can use that token to make a download. These downloads are tracked in a log file.

3 Likes

Hello. Can I check whether there is now a way of automating download of vocabs as suggested in this old thread ? Thanks.

Hi @andysouth - that topic is still an actively discussed one, it is however a good chunk of work and the community had its hands full to solve more pressing matters until now. It would be quite convenient to just trigger a download and add automated processing, but for example the CPT4 processing requires running a tool locally that complements the downloaded concepts, so that would have to be factored in as well… also, maybe license restricted vocabularies, especially if a license restriction had been added after you added that to your list of desired vocabs. Or the rare case of a vocab being renamed (it did happen in the past, always for a very good reason!). So, I guess first we need a team of analysts thinking this process through very thoroughly and then distill this into requirements before we can find another team of developers adding that to the Athena set of capabilities. Is there a role that you could fill?

Hi @mik - many thanks for your reply which helps me to understand that the issue is not as straightforward as my naive expectations. I imagined a potential R function allowing specification of vocabs required. I’m relatively new to OMOP. I would be happy to contribute in future if I can fit with my other commitments. Our current workflow to update our vocab metadata is to share a screenshot of ticked boxes and add any newly required ones.
For now it is good to know that I’m not missing an easy solution.
Best wishes, Andy

I think it would be nice if the capability existed to use an HTTP Client with Basic Authentication to download these from a secure endpoint. Of course, the requestor would also have to add additional parameters to describe what they want (similar to the web interface), as Gregk mentioned in Feb of last year. I think enforcing authentication in these requests would also allow downloads to be limited/throttled to help prevent abuse.

Regarding the CPT4 processing… I’m a bit new to OMOP, however it looks like this process adds CPT4 concepts to the CONCEPTS.csv after the file has been downloaded. Perhaps this could still be performed locally along with other post-processing tasks.

The biggest use case I have for this type of functionality is data governance across multiple environments (whether local or in a shared workspace). The more we can perform programmatically, the less error prone our processes become. I suppose we could maintain our own repository, but someone would have to duplicate the efforts manually to synchronize our local copy with the ATHENA repository.

I would be pleased to help with this effort wherever possible (write code, frame the payload/serializable, help create third party r/python/.net libraries and usage examples to help others take adopt this). Just might need someone to show me around and point me to where the contributions can be made :slight_smile:

Sadly this has been a discussed topic for 7 years now. It would be a very useful capability.

guys, to reiterate - it is really mostly not a technical issue, technically - it is all doable. It is a capacity issue. there is no unlimited funding for this infrastructure and the download API capability (yes, very useful) - if not careful or misused - could very easily overload the server resources or consume a lot of expensive bandwidth. And there is already a precedence for that a couple of years back.

hopefully, we will figure out how to tackle these mostly non-technical issues very soon.

Understood. I deal with similar challenges to prevent abuse & runaway cost on a daily basis lately (for better or worse). It’s a lot more than just limiting the number of downloads per account.

Commenting on the other direction that was suggested, I wanted to add that the push mechanism would be equally helpful and an adequate solution. If we could setup a push to an s3 bucket or azure storage account + web hook notification, we could just maintain our own hosted repositories. This would offload a lot of the cost to the data consumer. Initially thinking this could be a batch process with no external triggers.

Anyway, it sounds like there is already a discussion outside of this thread surrounding this functionality and a lot of history I’m not familiar with yet. I could probably still make some meaningful contributions if the opportunity is there. Otherwise, I’ll just share my comments on what my team would find helpful and hope it helps.

Thanks Greg!

1 Like

hey Zach - this is actually a better - both scalability and control wise - solution to the problem than allowing a direct access to the API. We should definitely talk!

1 Like

Hi all - It’s great to hear activity on this topic! Has it actually been 7 years !!
I’d love to chat more and add some elbow grease, if helpful.

Don

t