Upload Vocabularies csv files in BigQuery

pratik · September 5, 2022, 8:44pm

Dear Team,

I am using the R script to upload ATHENA generated csv files into BigQuery. I have followed the discussion taken the R code from The R package to upload ATHENA-generated vocabulary csv files into DB and https://github.com/ABMI/OmopVocaManager.

I am not facing any error but the vocabularies are not loaded either. May I ask , if we can load via SQL using BQ commands and not do it using R?

mik · September 6, 2022, 7:53pm

Hi, just an observation from my side: I have tried to use the R scripts that can be found in the ETL-Synthea repository and while they work great in general, for BigQuery they take excruciatingly long.
I ended up uploading the individual vocabulary files to the storage bucket and used the cloud shell command “bq load” to get them there (e.g. “bq load --field_delimiter=tab --source_format=CSV --skip_leading_rows=1 [your dataset].concept_class gs://[your bucket path]/CONCEPT_CLASS.csv”).
Two things:

for the Concept Synonym table you need to indicate specifically, that no quoting is used with the option --quote=""
the date conversion does not really work… for the tables concept, concept_relationship and drug_strength I have built temporary tables and made the valid_start_date and valid_end_date field STRING fields. After import into these temporary tables I copied the content over to the actual real tables and converted the date fields (e.g. “parse_date(”%Y%m%d", valid_start_DATE) AS valid_start_DATE").
Hope this helps, but I am sure there are others with even more sophisticated approaches.
~mik

mgkahn · September 6, 2022, 9:52pm

@pratik Our team at the University of Colorado uses a python script to download the vocab csv files, composer jobs (also python scripts) to upload the vocab csv files to a local GCS bucket and composer to move from GCS to GBQ. This is a weekly recurring automated job.

I have now told you everything I know in one sentence! For GCP-specific technical specifics that might actually be useful to you, please reach out to Bujia Zhang (Bujia.Zhang@cuanschutz.edu) in our group.

mik · September 21, 2022, 10:09am

@pratik - I realized that there is already quite a bit of logic represented in the MIMIC repository in the OHDSI github for vocabulary load (of all people, I should have known). Maybe that provides some additional inspiration. Probably to some degree similar to what CU does, @mgkahn ?