Just wondering what folks are using to import the large OHDSI terminology tables into Databricks from the tab delimited text file from Athena (Athena).
We’re currently using the API but its taking a long time for the larger files, hours for concept_relationship (~2G), concept_ancestor (~1.3G).
What sort of times are you seeing to upload these files and what are you using to upload them to Databricks?
I’m not expert on this, as our ETL is handled by a different team in my company, but I’d imagine a multipart S3 upload shouldn’t be too bad for those files, no?
On the Athena side, it seems like there is no way around manually downloading and uploading files whenever there is an update to the vocabulary tables. Is that correct?
Correct, unfortunately there’s no Athena API for creating a vocabulary bundle without manual interaction. Many people (me too) would like to use that for automation, and the question often pops upon the forum, but apparently this would require major changes in the current implementation.
Long story short, best solution I came up with, inspired by the databricks example, was storing the vocabulary concepts on an s3 bucket, downloading/extracting using %sh magic commands in a cell, create a dataframe for each CSV in the extracted directory, write CSV using dataframe writer.
I can share my example with you if you’re interested. It’s much less complex than the databricks example, plus it works in the community edition.
Hi Philip, good point . The example runs faster if you have the compute in your own workspace. If working in the community edition of Databricks, be prepared to wait about 15 minutes.
A quick note about managing these tables - I haven’t been able to import the CSV’s to Usagi when exported from the notebook using the GUI. Usagi appears to want a tab delimited csv (tsv), but Databricks does CSV by default. If you want to make your own pipeline for whatever reason to bring this into Usagi, you’ll need to make sure it’s exported using the correct format using the dataframe writer.