I wanted to share some important information related to the OHDSI tool stack support for Google Cloud Platform (GCP).
ATLAS 2.7.7 maintenance build was recently released. Among other, it contains a number of important fixes related to enabling better support for Google BigQuery. For those organizations utilizing Google BigQuery, it is highly recommended to upgrade ATLAS to that version. https://github.com/OHDSI/Atlas/tree/v2.7.7
Over the past couple of months, the Odysseus team worked collaboratively with multiple OHDSI collaborators to ensure that core OHDSI R library and methods are updated to work with Google BigQuery. There were a number of fixes across multiple important shared libraries, including DatabaseConnector, SQLRenderer and other.
While providing support during the execution of COVID19 studies with our Stanford partners, we have identified a number of Google BigQuery limits that required finding a complex workaround or a patch during code execution. We provided this feedback to the Google team and now happy to share that Google listened and removed, or significantly raised, multiple limits, including a number of concurrent inserts or table metadata updates. As a result of this close collaboration, the Stanford is able successfully participate in COVID19 related studies.
Over the past few months, OHDSI community have been having a continuous discussions regarding a need to do a more thorough testing of the OHDSI tools (ATLAS), shared components and study packages. One of the key components in testing is an ability to have access to the supported database platform. Google has offered to provide OHDSI community with some funding to be used during code testing with Google BigQuery testing. Odysseus and Google have stood up a shared environment that can be used for such purposes.
Also, very excited to share that Odysseus have created a new OHDSI ATLAS 2.7.7 image which will be available in Google Marketplace for a free download and installation shortly (pending review).
Tangential issue re BigQuery support, and hoping this community can provide input on best practice. What are other teams doing?
BQ does not have a concept of a primary key or auto-increment for relational model support. All of the primary keys in the OMOP DDL are defined as INT64s. Without that auto-increment function, we are left to trying workarounds for large batch/streaming/parallel inserts. We could try to use a 64 bit hash function over some fields (FARM_FINGERPRINT) but would then have to manage hash collisions on the fly. The easier and more robust thing is to change the DDL and make these primary keys all strings, and use the GENERATE_UUID function. But at what expense - will this break any analytics down the road?
Hoping to not be first in wondering about this.
Thanks!
For those who have had this problem before - what has been your approach? Did you convert to strings and use UUID? In what way did it impact analytical tools such as ATLAS?
This took me a while to separate from our main code but here it is. no repo either. I used a BigQuery public dataset for demonstration. If you have any questions feel free to ask them here. I hope it helps you. I think we have tested the method to produce primary keys up to 40B rows. This code was originally made by one of our engineers and I adapted it to post it here.
CREATE OR REPLACE TABLE
`som-rit-starr-training.jdposada_explore.cdm_table` AS (
SELECT
procedure_concept_id,
GENERATE_UUID() AS original_key
FROM
`bigquery-public-data.cms_synthetic_patient_data_omop`.procedure_occurrence );
WITH
cdm_table AS (
SELECT
*
FROM
`som-rit-starr-training.jdposada_explore.cdm_table` ),
with_random AS (
SELECT
CAST(FLOOR(RAND()*1000000000) AS STRING) AS r,
original_key,
FROM
cdm_table ),
random_w_rownum AS (
SELECT
r,
original_key,
ROW_NUMBER() OVER (PARTITION BY r ) AS seq
FROM
with_random ),
randow_w_rownum_maxrc AS (
SELECT
LENGTH(CAST(MAX(seq) AS STRING)) AS max_psize_digit
FROM
random_w_rownum ),
random_w_rownum_ps AS (
SELECT
*
FROM
random_w_rownum a,
randow_w_rownum_maxrc b ),
partitioned_by AS (
SELECT
original_key,
r,
CAST( CONCAT(FORMAT("%09d", CAST(r AS INT64)), FORMAT(CONCAT("%0",CAST(max_psize_digit AS STRING), "d"),seq)) AS INT64) AS final_key
FROM
random_w_rownum_ps )
SELECT
final_key AS procedure_occurrence_id,
cdm_table.* EXCEPT(original_key)
FROM
cdm_table
INNER JOIN
partitioned_by
ON
cdm_table.original_key = partitioned_by.original_key
Thank you @jposada.
Looks like this is a 2-step process, please confirm. You originally load your OMOP tables with a UUID STRING as the key, likely generated on the fly. Once loaded, you review the entire table and replace that key field with a generated random & unique INT64. Is that correct?
Do you maintain 2 datasets then? One with UUIDs for keys as part of data ingestion, and one with INT64s for keys for analytics?
Hi @gregk and @jposada, I’m trying to connect to BigQuery using connection details. I’m able to get it to work if I have permissions to create a service account. Has anyone been able to get DatabaseConnector to work with BigQuery when using Using a Google Account?
I have another question : Have you run into issues w/ casing with OHDSI packages and BigQuery? From my understanding, BigQuery is case-sensitive, but I’ve noticed that when using CohortDiagnostics, it calls DatabaseConnector:: renderTranslateQuerySql where it renders and then translates sql. I have a schema in BigQuery that is in upper case, but when translate is run, the schema in the query is converted to lower case. Have you experienced this? When I manually call translate and then render, I get the expected SQL with the proper casing.
Tagging @schuemie and @Gowtham_Rao since they know DatabaseConnector and CohortDiagnostics. Thanks.
Yes, there is an issue with casing for table names. It usually has not bitten me because we took the internal decision of lowercasing everything we produce in BQ because of this. This decision was made not only for OHDSI but all our other Research Data Warehouse assets.
That makes sense. I unfortunately do not have that option. I’m trying to get OHDSI tools working in the All of Us Research Workbench (and we’re very close). This is the last hurdle (I think). SqlRender handles cases from what I can tell; it’s just translate that does not.
Same for Colorado as for Stanford – we moved to all lower-case table names for the same reason. For some logic that only Googler’s understand, GBQ is case sensitive for table names but not for column names. Go figure…