Google BigQuery support update

Dear OHDSI Community,

I wanted to share some important information related to the OHDSI tool stack support for Google Cloud Platform (GCP).

  • ATLAS 2.7.7 maintenance build was recently released. Among other, it contains a number of important fixes related to enabling better support for Google BigQuery. For those organizations utilizing Google BigQuery, it is highly recommended to upgrade ATLAS to that version.
    https://github.com/OHDSI/Atlas/tree/v2.7.7

  • Over the past couple of months, the Odysseus team worked collaboratively with multiple OHDSI collaborators to ensure that core OHDSI R library and methods are updated to work with Google BigQuery. There were a number of fixes across multiple important shared libraries, including DatabaseConnector, SQLRenderer and other.

  • While providing support during the execution of COVID19 studies with our Stanford partners, we have identified a number of Google BigQuery limits that required finding a complex workaround or a patch during code execution. We provided this feedback to the Google team and now happy to share that Google listened and removed, or significantly raised, multiple limits, including a number of concurrent inserts or table metadata updates. As a result of this close collaboration, the Stanford is able successfully participate in COVID19 related studies.

  • Over the past few months, OHDSI community have been having a continuous discussions regarding a need to do a more thorough testing of the OHDSI tools (ATLAS), shared components and study packages. One of the key components in testing is an ability to have access to the supported database platform. Google has offered to provide OHDSI community with some funding to be used during code testing with Google BigQuery testing. Odysseus and Google have stood up a shared environment that can be used for such purposes.

  • Also, very excited to share that Odysseus have created a new OHDSI ATLAS 2.7.7 image which will be available in Google Marketplace for a free download and installation shortly (pending review).

Greg

7 Likes

The atlas image on gcp marketplace is now available:

https://console.cloud.google.com/marketplace/browse?q=ohdsi%20atlas

1 Like

Tangential issue re BigQuery support, and hoping this community can provide input on best practice. What are other teams doing?

BQ does not have a concept of a primary key or auto-increment for relational model support. All of the primary keys in the OMOP DDL are defined as INT64s. Without that auto-increment function, we are left to trying workarounds for large batch/streaming/parallel inserts. We could try to use a 64 bit hash function over some fields (FARM_FINGERPRINT) but would then have to manage hash collisions on the fly. The easier and more robust thing is to change the DDL and make these primary keys all strings, and use the GENERATE_UUID function. But at what expense - will this break any analytics down the road?

Hoping to not be first in wondering about this.
Thanks!

Hi @mqunell8,

At Stanford, we implemented a custom BQ SQL to deal with that. Send me a direct message with your email and I can send you the code.

Best,

Jose

1 Like

Hi @jposada,
We have a similar problem. I will DM you also.

Thank you!

We have a similar problem. Will be reaching out soon.

For those who have had this problem before - what has been your approach? Did you convert to strings and use UUID? In what way did it impact analytical tools such as ATLAS?

Hi All,

I am putting together the code in a shareable form in Github. I Will post the link here in a couple of days

2 Likes

This took me a while to separate from our main code but here it is. no repo either. I used a BigQuery public dataset for demonstration. If you have any questions feel free to ask them here. I hope it helps you. I think we have tested the method to produce primary keys up to 40B rows. This code was originally made by one of our engineers and I adapted it to post it here.

@Javier, @mqunell8, @anoop.yamsani, @Konstantin_Yaroshove

CREATE OR REPLACE TABLE
  `som-rit-starr-training.jdposada_explore.cdm_table` AS (
  SELECT
    procedure_concept_id,
    GENERATE_UUID() AS original_key
  FROM
    `bigquery-public-data.cms_synthetic_patient_data_omop`.procedure_occurrence );
WITH
  cdm_table AS (
  SELECT
    *
  FROM
    `som-rit-starr-training.jdposada_explore.cdm_table` ),
  with_random AS (
  SELECT
    CAST(FLOOR(RAND()*1000000000) AS STRING) AS r,
    original_key,
  FROM
    cdm_table ),
  random_w_rownum AS (
  SELECT
    r,
    original_key,
    ROW_NUMBER() OVER (PARTITION BY r ) AS seq
  FROM
    with_random ),
  randow_w_rownum_maxrc AS (
  SELECT
    LENGTH(CAST(MAX(seq) AS STRING)) AS max_psize_digit
  FROM
    random_w_rownum ),
  random_w_rownum_ps AS (
  SELECT
    *
  FROM
    random_w_rownum a,
    randow_w_rownum_maxrc b ),
  partitioned_by AS (
  SELECT
    original_key,
    r,
    CAST( CONCAT(FORMAT("%09d", CAST(r AS INT64)), FORMAT(CONCAT("%0",CAST(max_psize_digit AS STRING), "d"),seq)) AS INT64) AS final_key
  FROM
    random_w_rownum_ps )
SELECT
  final_key AS procedure_occurrence_id,
  cdm_table.* EXCEPT(original_key)
FROM
  cdm_table
INNER JOIN
  partitioned_by
ON
  cdm_table.original_key = partitioned_by.original_key
3 Likes

Thank you @jposada.
Looks like this is a 2-step process, please confirm. You originally load your OMOP tables with a UUID STRING as the key, likely generated on the fly. Once loaded, you review the entire table and replace that key field with a generated random & unique INT64. Is that correct?

Do you maintain 2 datasets then? One with UUIDs for keys as part of data ingestion, and one with INT64s for keys for analytics?

We only keep the one with the int64

Thanks a lot Jose!

Hi @gregk and @jposada, I’m trying to connect to BigQuery using connection details. I’m able to get it to work if I have permissions to create a service account. Has anyone been able to get DatabaseConnector to work with BigQuery when using Using a Google Account?

1 Like

yes, I use the application default credentials.

You can use this small package I created to ease the process of using those credentials.

Thanks! I’ll check it out.

Thanks, @jposada. It worked!

I have another question :slight_smile: : Have you run into issues w/ casing with OHDSI packages and BigQuery? From my understanding, BigQuery is case-sensitive, but I’ve noticed that when using CohortDiagnostics, it calls DatabaseConnector:: renderTranslateQuerySql where it renders and then translates sql. I have a schema in BigQuery that is in upper case, but when translate is run, the schema in the query is converted to lower case. Have you experienced this? When I manually call translate and then render, I get the expected SQL with the proper casing.

Tagging @schuemie and @Gowtham_Rao since they know DatabaseConnector and CohortDiagnostics. Thanks.

Yes, there is an issue with casing for table names. It usually has not bitten me because we took the internal decision of lowercasing everything we produce in BQ because of this. This decision was made not only for OHDSI but all our other Research Data Warehouse assets.

That makes sense. I unfortunately do not have that option. I’m trying to get OHDSI tools working in the All of Us Research Workbench (and we’re very close). This is the last hurdle (I think). SqlRender handles cases from what I can tell; it’s just translate that does not.

Same for Colorado as for Stanford – we moved to all lower-case table names for the same reason. For some logic that only Googler’s understand, GBQ is case sensitive for table names but not for column names. Go figure…