Hi, just an observation from my side: I have tried to use the R scripts that can be found in the ETL-Synthea repository and while they work great in general, for BigQuery they take excruciatingly long.
I ended up uploading the individual vocabulary files to the storage bucket and used the cloud shell command “bq load” to get them there (e.g. “bq load --field_delimiter=tab --source_format=CSV --skip_leading_rows=1 [your dataset].concept_class gs://[your bucket path]/CONCEPT_CLASS.csv”).
Two things:
for the Concept Synonym table you need to indicate specifically, that no quoting is used with the option --quote=""
the date conversion does not really work… for the tables concept, concept_relationship and drug_strength I have built temporary tables and made the valid_start_date and valid_end_date field STRING fields. After import into these temporary tables I copied the content over to the actual real tables and converted the date fields (e.g. “parse_date(”%Y%m%d", valid_start_DATE) AS valid_start_DATE").
Hope this helps, but I am sure there are others with even more sophisticated approaches.
~mik
@pratik Our team at the University of Colorado uses a python script to download the vocab csv files, composer jobs (also python scripts) to upload the vocab csv files to a local GCS bucket and composer to move from GCS to GBQ. This is a weekly recurring automated job.
I have now told you everything I know in one sentence! For GCP-specific technical specifics that might actually be useful to you, please reach out to Bujia Zhang (Bujia.Zhang@cuanschutz.edu) in our group.
@pratik - I realized that there is already quite a bit of logic represented in the MIMIC repository in the OHDSI github for vocabulary load (of all people, I should have known). Maybe that provides some additional inspiration. Probably to some degree similar to what CU does, @mgkahn ?