OHDSI Home | Forums | Wiki | Github

Google BigQuery support update


(Gregory Klebanov) #1

Dear OHDSI Community,

I wanted to share some important information related to the OHDSI tool stack support for Google Cloud Platform (GCP).

  • ATLAS 2.7.7 maintenance build was recently released. Among other, it contains a number of important fixes related to enabling better support for Google BigQuery. For those organizations utilizing Google BigQuery, it is highly recommended to upgrade ATLAS to that version.
    https://github.com/OHDSI/Atlas/tree/v2.7.7

  • Over the past couple of months, the Odysseus team worked collaboratively with multiple OHDSI collaborators to ensure that core OHDSI R library and methods are updated to work with Google BigQuery. There were a number of fixes across multiple important shared libraries, including DatabaseConnector, SQLRenderer and other.

  • While providing support during the execution of COVID19 studies with our Stanford partners, we have identified a number of Google BigQuery limits that required finding a complex workaround or a patch during code execution. We provided this feedback to the Google team and now happy to share that Google listened and removed, or significantly raised, multiple limits, including a number of concurrent inserts or table metadata updates. As a result of this close collaboration, the Stanford is able successfully participate in COVID19 related studies.

  • Over the past few months, OHDSI community have been having a continuous discussions regarding a need to do a more thorough testing of the OHDSI tools (ATLAS), shared components and study packages. One of the key components in testing is an ability to have access to the supported database platform. Google has offered to provide OHDSI community with some funding to be used during code testing with Google BigQuery testing. Odysseus and Google have stood up a shared environment that can be used for such purposes.

  • Also, very excited to share that Odysseus have created a new OHDSI ATLAS 2.7.7 image which will be available in Google Marketplace for a free download and installation shortly (pending review).

Greg


(Matt Hall) #2

The atlas image on gcp marketplace is now available:

https://console.cloud.google.com/marketplace/browse?q=ohdsi%20atlas


(mqunell8) #3

Tangential issue re BigQuery support, and hoping this community can provide input on best practice. What are other teams doing?

BQ does not have a concept of a primary key or auto-increment for relational model support. All of the primary keys in the OMOP DDL are defined as INT64s. Without that auto-increment function, we are left to trying workarounds for large batch/streaming/parallel inserts. We could try to use a 64 bit hash function over some fields (FARM_FINGERPRINT) but would then have to manage hash collisions on the fly. The easier and more robust thing is to change the DDL and make these primary keys all strings, and use the GENERATE_UUID function. But at what expense - will this break any analytics down the road?

Hoping to not be first in wondering about this.
Thanks!


(Jose Posada) #4

Hi @mqunell8,

At Stanford, we implemented a custom BQ SQL to deal with that. Send me a direct message with your email and I can send you the code.

Best,

Jose


(Javier Gracia-Tabuenca) #5

Hi @jposada,
We have a similar problem. I will DM you also.


(mqunell8) #6

Thank you!


(Anoop Yamsani) #7

We have a similar problem. Will be reaching out soon.


(mqunell8) #8

For those who have had this problem before - what has been your approach? Did you convert to strings and use UUID? In what way did it impact analytical tools such as ATLAS?


(Jose Posada) #9

Hi All,

I am putting together the code in a shareable form in Github. I Will post the link here in a couple of days


(Jose Posada) #10

This took me a while to separate from our main code but here it is. no repo either. I used a BigQuery public dataset for demonstration. If you have any questions feel free to ask them here. I hope it helps you. I think we have tested the method to produce primary keys up to 40B rows. This code was originally made by one of our engineers and I adapted it to post it here.

@Javier, @mqunell8, @anoop.yamsani, @Konstantin_Yaroshove

CREATE OR REPLACE TABLE
  `som-rit-starr-training.jdposada_explore.cdm_table` AS (
  SELECT
    procedure_concept_id,
    GENERATE_UUID() AS original_key
  FROM
    `bigquery-public-data.cms_synthetic_patient_data_omop`.procedure_occurrence );
WITH
  cdm_table AS (
  SELECT
    *
  FROM
    `som-rit-starr-training.jdposada_explore.cdm_table` ),
  with_random AS (
  SELECT
    CAST(FLOOR(RAND()*1000000000) AS STRING) AS r,
    original_key,
  FROM
    cdm_table ),
  random_w_rownum AS (
  SELECT
    r,
    original_key,
    ROW_NUMBER() OVER (PARTITION BY r ) AS seq
  FROM
    with_random ),
  randow_w_rownum_maxrc AS (
  SELECT
    LENGTH(CAST(MAX(seq) AS STRING)) AS max_psize_digit
  FROM
    random_w_rownum ),
  random_w_rownum_ps AS (
  SELECT
    *
  FROM
    random_w_rownum a,
    randow_w_rownum_maxrc b ),
  partitioned_by AS (
  SELECT
    original_key,
    r,
    CAST( CONCAT(FORMAT("%09d", CAST(r AS INT64)), FORMAT(CONCAT("%0",CAST(max_psize_digit AS STRING), "d"),seq)) AS INT64) AS final_key
  FROM
    random_w_rownum_ps )
SELECT
  final_key AS procedure_occurrence_id,
  cdm_table.* EXCEPT(original_key)
FROM
  cdm_table
INNER JOIN
  partitioned_by
ON
  cdm_table.original_key = partitioned_by.original_key

(mqunell8) #11

Thank you @jposada.
Looks like this is a 2-step process, please confirm. You originally load your OMOP tables with a UUID STRING as the key, likely generated on the fly. Once loaded, you review the entire table and replace that key field with a generated random & unique INT64. Is that correct?

Do you maintain 2 datasets then? One with UUIDs for keys as part of data ingestion, and one with INT64s for keys for analytics?


(Jose Posada) #12

We only keep the one with the int64


(Anoop Yamsani) #13

Thanks a lot Jose!


t