Google BigQuery support update

gregk · July 23, 2020, 7:09pm

Dear OHDSI Community,

I wanted to share some important information related to the OHDSI tool stack support for Google Cloud Platform (GCP).

ATLAS 2.7.7 maintenance build was recently released. Among other, it contains a number of important fixes related to enabling better support for Google BigQuery. For those organizations utilizing Google BigQuery, it is highly recommended to upgrade ATLAS to that version.
https://github.com/OHDSI/Atlas/tree/v2.7.7
Over the past couple of months, the Odysseus team worked collaboratively with multiple OHDSI collaborators to ensure that core OHDSI R library and methods are updated to work with Google BigQuery. There were a number of fixes across multiple important shared libraries, including DatabaseConnector, SQLRenderer and other.
While providing support during the execution of COVID19 studies with our Stanford partners, we have identified a number of Google BigQuery limits that required finding a complex workaround or a patch during code execution. We provided this feedback to the Google team and now happy to share that Google listened and removed, or significantly raised, multiple limits, including a number of concurrent inserts or table metadata updates. As a result of this close collaboration, the Stanford is able successfully participate in COVID19 related studies.
Over the past few months, OHDSI community have been having a continuous discussions regarding a need to do a more thorough testing of the OHDSI tools (ATLAS), shared components and study packages. One of the key components in testing is an ability to have access to the supported database platform. Google has offered to provide OHDSI community with some funding to be used during code testing with Google BigQuery testing. Odysseus and Google have stood up a shared environment that can be used for such purposes.
Also, very excited to share that Odysseus have created a new OHDSI ATLAS 2.7.7 image which will be available in Google Marketplace for a free download and installation shortly (pending review).

Greg

Matt_Hall · December 18, 2020, 9:54pm

The atlas image on gcp marketplace is now available:

https://console.cloud.google.com/marketplace/browse?q=ohdsi%20atlas

mqunell8 · March 3, 2021, 1:25am

Tangential issue re BigQuery support, and hoping this community can provide input on best practice. What are other teams doing?

BQ does not have a concept of a primary key or auto-increment for relational model support. All of the primary keys in the OMOP DDL are defined as INT64s. Without that auto-increment function, we are left to trying workarounds for large batch/streaming/parallel inserts. We could try to use a 64 bit hash function over some fields (FARM_FINGERPRINT) but would then have to manage hash collisions on the fly. The easier and more robust thing is to change the DDL and make these primary keys all strings, and use the GENERATE_UUID function. But at what expense - will this break any analytics down the road?

Hoping to not be first in wondering about this.
Thanks!

jposada · March 3, 2021, 4:58am

Hi @mqunell8,

At Stanford, we implemented a custom BQ SQL to deal with that. Send me a direct message with your email and I can send you the code.

Best,

Jose

Javier · March 3, 2021, 6:41am

Hi @jposada,
We have a similar problem. I will DM you also.

mqunell8 · March 3, 2021, 1:23pm

Thank you!

anoop.yamsani · March 3, 2021, 3:10pm

We have a similar problem. Will be reaching out soon.

mqunell8 · March 4, 2021, 2:29pm

For those who have had this problem before - what has been your approach? Did you convert to strings and use UUID? In what way did it impact analytical tools such as ATLAS?

jposada · March 4, 2021, 4:59pm

Hi All,

I am putting together the code in a shareable form in Github. I Will post the link here in a couple of days

jposada · March 14, 2021, 7:21pm

This took me a while to separate from our main code but here it is. no repo either. I used a BigQuery public dataset for demonstration. If you have any questions feel free to ask them here. I hope it helps you. I think we have tested the method to produce primary keys up to 40B rows. This code was originally made by one of our engineers and I adapted it to post it here.

@Javier, @mqunell8, @anoop.yamsani, @Konstantin_Yaroshove

CREATE OR REPLACE TABLE
  `som-rit-starr-training.jdposada_explore.cdm_table` AS (
  SELECT
    procedure_concept_id,
    GENERATE_UUID() AS original_key
  FROM
    `bigquery-public-data.cms_synthetic_patient_data_omop`.procedure_occurrence );
WITH
  cdm_table AS (
  SELECT
    *
  FROM
    `som-rit-starr-training.jdposada_explore.cdm_table` ),
  with_random AS (
  SELECT
    CAST(FLOOR(RAND()*1000000000) AS STRING) AS r,
    original_key,
  FROM
    cdm_table ),
  random_w_rownum AS (
  SELECT
    r,
    original_key,
    ROW_NUMBER() OVER (PARTITION BY r ) AS seq
  FROM
    with_random ),
  randow_w_rownum_maxrc AS (
  SELECT
    LENGTH(CAST(MAX(seq) AS STRING)) AS max_psize_digit
  FROM
    random_w_rownum ),
  random_w_rownum_ps AS (
  SELECT
    *
  FROM
    random_w_rownum a,
    randow_w_rownum_maxrc b ),
  partitioned_by AS (
  SELECT
    original_key,
    r,
    CAST( CONCAT(FORMAT("%09d", CAST(r AS INT64)), FORMAT(CONCAT("%0",CAST(max_psize_digit AS STRING), "d"),seq)) AS INT64) AS final_key
  FROM
    random_w_rownum_ps )
SELECT
  final_key AS procedure_occurrence_id,
  cdm_table.* EXCEPT(original_key)
FROM
  cdm_table
INNER JOIN
  partitioned_by
ON
  cdm_table.original_key = partitioned_by.original_key

mqunell8 · March 15, 2021, 1:22pm

Thank you @jposada.
Looks like this is a 2-step process, please confirm. You originally load your OMOP tables with a UUID STRING as the key, likely generated on the fly. Once loaded, you review the entire table and replace that key field with a generated random & unique INT64. Is that correct?

Do you maintain 2 datasets then? One with UUIDs for keys as part of data ingestion, and one with INT64s for keys for analytics?

jposada · March 15, 2021, 4:06pm

We only keep the one with the int64

anoop.yamsani · March 30, 2021, 3:39am

Thanks a lot Jose!

cukarthik · July 11, 2021, 10:49pm

Hi @gregk and @jposada, I’m trying to connect to BigQuery using connection details. I’m able to get it to work if I have permissions to create a service account. Has anyone been able to get DatabaseConnector to work with BigQuery when using Using a Google Account?

jposada · July 12, 2021, 4:34pm

yes, I use the application default credentials.

You can use this small package I created to ease the process of using those credentials.

cukarthik · July 12, 2021, 6:01pm

Thanks! I’ll check it out.

cukarthik · July 13, 2021, 2:55pm

Thanks, @jposada. It worked!

I have another question : Have you run into issues w/ casing with OHDSI packages and BigQuery? From my understanding, BigQuery is case-sensitive, but I’ve noticed that when using CohortDiagnostics, it calls DatabaseConnector:: renderTranslateQuerySql where it renders and then translates sql. I have a schema in BigQuery that is in upper case, but when translate is run, the schema in the query is converted to lower case. Have you experienced this? When I manually call translate and then render, I get the expected SQL with the proper casing.

Tagging @schuemie and @Gowtham_Rao since they know DatabaseConnector and CohortDiagnostics. Thanks.

jposada · July 13, 2021, 2:59pm

Yes, there is an issue with casing for table names. It usually has not bitten me because we took the internal decision of lowercasing everything we produce in BQ because of this. This decision was made not only for OHDSI but all our other Research Data Warehouse assets.

cukarthik · July 13, 2021, 3:01pm

That makes sense. I unfortunately do not have that option. I’m trying to get OHDSI tools working in the All of Us Research Workbench (and we’re very close). This is the last hurdle (I think). SqlRender handles cases from what I can tell; it’s just translate that does not.

mgkahn · July 13, 2021, 3:01pm

Same for Colorado as for Stanford – we moved to all lower-case table names for the same reason. For some logic that only Googler’s understand, GBQ is case sensitive for table names but not for column names. Go figure…