OHDSI Home | Forums | Wiki | Github

Broadsea installation using BigQuery

Hi,

I am wondering if someone has been able to get a Broadsea installation up and running, using BigQuery? We are struggling to get this to work. At the moment we have Broadsea running on a Google Cloud Compute Instance VM. Atlas is working fine, and we added the BigQuery connection details in the source and source_daimon tables.

-- OHDSI CDM source
INSERT INTO webapi.source( source_id, source_name, source_key, source_connection, source_dialect, is_cache_enabled)
VALUES (2, 'My CDM', 'my_cdm',
  'jdbc:bigquery://http s://www.googleapis.com/bigquery/v2:443;ProjectId=<project_id>;OAuthType=0;OAuthServiceAcctEmail=<service_account>;OAuthPvtKeyPath=<service_account_key.json>;', 'bigquery', 't');

-- CDM daimon
INSERT INTO webapi.source_daimon( source_daimon_id, source_id, daimon_type, table_qualifier, priority) VALUES (4, 2, 0, 'my_cdm', 0);

-- VOCABULARY daimon
INSERT INTO webapi.source_daimon( source_daimon_id, source_id, daimon_type, table_qualifier, priority) VALUES (5, 2, 1, 'my_cdm', 10);

-- RESULTS daimon
INSERT INTO webapi.source_daimon( source_daimon_id, source_id, daimon_type, table_qualifier, priority) VALUES (6, 2, 2, 'results', 0);

-- TEMP daimon
INSERT INTO webapi.source_daimon( source_daimon_id, source_id, daimon_type, table_qualifier, priority) VALUES (7, 2, 5, 'temp', 0);

According to the Google Documentation (ODBC and JDBC drivers for BigQuery  |  Google Cloud) we used the Simba BigQuery JDBC connector: http s://storage.googleapis.com/simba-bq-release/jdbc/Simba%20Google%20BigQuery%20JDBC%20Connector%20Install%20and%20Configuration%20Guide-1.3.2.1003.pdf. The connection string for authentication with a service account is in the webapi.source table, and the JDBC driver .jar files are saved in the same folder as the docker-compose.yml, as described in the Broadsea README.

We were able to successfully run the docker-compose file, and our BigQuery source appears in data sources, but the data is not shown. When we then access the docker logs of the WebAPI, we see the following error message: org.springframework.jdbc.CannotGetJdbcConnectionException: Could not get JDBC Connection; nested exception is java.sql.SQLException: No suitable driver found for jdbc:bigquery://http s://www.googleapis.com/bigquery/v2:443;...

Since we were not able to fix the JDBC connection error, we then decided to build the WebAPI Dockerfile from scratch, using the webapi-bigquery profile: docker build -t broadsea-webapi-bigquery --build-arg MAVEN_PROFILE=webapi-bigquery .

However, also when building the WebAPI we soon ran into trouble. It seems there are still a lot of missing pieces in the webapi-bigquery profile. First of all, we had some missing artifact issues, which seem to be fixed by switching from JDK 11 to JDK 8, as suggested here: Build of WebAPI (Docker-Image) fails due to missing artifact - #2 by Chris_Knoll. However, we still run into trouble in the following piece of code in the Dockerfile (http s://github.com/OHDSI/WebAPI/blob/master/Dockerfile).

RUN mkdir .git \
    && mvn package \
     -P${MAVEN_PROFILE}

This results in the following error: Failed to execute goal org.apache.maven.plugins:maven-install-plugin:2.5.2:install-file (google-api-client) on project WebAPI: The specified file '/code/src/main/extras/bigquery/google-api-client-1.28.0.jar' not exists. I figured it might be something regarding and incorrect folder reference, but I haven’t been able to solve it.

Hopefully someone out here know what this latest error means, or has already figured out a solution for using Broadsea in combination with BigQuery. Please let me know if you can help.

Kind regards,
Tim

Btw, I’m sorry I messed up some of the links, but apparently I’m not allowed to use more than 2 of those.

My understanding of connecting to Google BigQuery is that the drivers need to be loaded in order to connect (which is why you get the first error that No Suitable Driver). We have WebAPI profiles that if you append the webapi-bigquery profile to your maven build, it should copy those artifacts into the WAR. Note, I’m not sure what steps you need to take for this in Broadsea. There are instructions in this location, however, which may help.

The other option is to have the google JDBC drivers copied to a location that makes them available to the host JVM globally. This would involve setting the CLASSPATH, or, if broadsea hosts a tomcat service, there is a folder to store shared java libraries. This is an example of how one vendor suggest to install JARs to Tomcat, and should work the same by copying the BigQuery JDBC drivers to $TOMCAT_DIR/lib.

Thanks Chris! I think you first suggestion is pointing me in the right direction. Based on the output of mvn package I assumed this first downloaded and then installed all dependencies, including the drivers. However, now I copied the drivers into the image and adjusted the version numbers in the pom file, I have successfully build the WebAPI image for the first time. Now I’ll try to include this file in my Broadsea docker-compose.

This is what I used:

WORKDIR /code

ARG MAVEN_PROFILE=webapi-postgresql,webapi-bigquery

# Compile code and repackage it
COPY pom.xml /code/
COPY src /code/src
COPY /bigquery/ /code/src/main/extras/bigquery/
RUN mvn clean package -e -X \
    -Dgit.branch=${GIT_BRANCH} \
    -Dgit.commit.id.abbrev=${GIT_COMMIT_ID_ABBREV} \
    -P${MAVEN_PROFILE} \
    && mkdir war \
    && mv target/WebAPI.war war \
    && cd war \
    && jar -xf WebAPI.war \
    && rm WebAPI.war

Regarding your second option, this doesn’t seem viable, since the Broadsea version of the WebAPI doesn’t seem to have Tomcat included.

Great news, I have Atlas now up and running using the BigQuery drivers mentioned in the previous messages. However, although some data is shown now, I’m still not able to show all reports. The WebAPI logs mention that the results.achilles_analysis table does not exists, which is right.

Where should this table be created? I did to following things to create tables:

  1. Run these DDL scripts: CommonDataModel/inst/ddl/5.4/bigquery at main · OHDSI/CommonDataModel · GitHub

  2. Run the scripts retrieved from http://host:port/WebAPI/ddl/results?dialect=bigquery&schema=<results_schema>&vocabSchema=<vocab_schema>&tempSchema=<temp_schema>&initConceptHierarchy=true

  3. Run the scripts generated with Achilles.

library(Achilles)

connectionDetails <- createConnectionDetails(
  dbms = "bigquery", 
  connectionString = "jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;ProjectId=<project>;OAuthType=0;OAuthServiceAcctEmail=<service_account>;OAuthPvtKeyPath=<key_path>;",
  user = "",
  password = "",
  pathToDriver = "<driver_path>"
)

achilles(connectionDetails, 
         cdmDatabaseSchema = "<cdm_schema>", 
         resultsDatabaseSchema = "<results_schema>",
         scratchDatabaseSchema = "<temp_schema>",
         vocabDatabaseSchema = "<vocab_schema>",
         sourceName = "BigQuery",
         outputFolder = "output",
         sqlOnly = TRUE
)

I assumed this last step was were the results.achilles_analysis should be created, but this isn’t the case. Is there any parameter I am missing, do I need to use another function, or…? Any help is appreciated.

I would have expected that the achilles_analysis table would have been created during the achilles analysis. @AnthonyMolinaro , would you have any info on this?

I’m digging through search results on seaching for achilles_analysis in the OHDSI repo, found this reference, so I would think that this table would have been created in Achilles.

Also, I found reference to creating this table in this PR, but since that release, iI’m not sure how the code has changed to create the analysis SQL and create the results tables. It looks like the analysis sql was split into individual files, and the create table statements were split out as well. I just can’t see where they are being invoked.

Update: I think i found it.

The default params to achilles sets CreateTable = TRUE, so I’m not sure why that table was not created.

I think I know, because I run the achilles() function with parameter sqlOnly = TRUE. Since there are still some bugs in the generated BigQuery SQL, I had to run the queries manually and fix the bugs while I encounter them.

However, where other statements are appended to the achillesSql object, the query to generate the achilles_analysis is not. For this query a create table statement is executed when sqlOnly is set to FALSE. The data is then inserted directly from a csv file, which does not happen in sqlOnly mode.

Ok, maybe on option is to run achilles on a sample CDM dataset (or an empty cdm schema) just to get the achilles_analysis table, and then you can copy it over to a databricks instance?

Thanks! For now I solved it by running the R code manually. I then uploaded the resulting csv file to Google Cloud Storage and inserted it in the achilles_analysis table in BigQuery. It seems Atlas is now working correctly, and I can see all BigQuery data.

Follow up on this post. At the moment there are two issues which I can’t figure out.

  1. When try to count the occurrence of concept sets within my data source, I get the following error message: Unrecognized name: person_count at [1:59]. This seems to occur in the following query:
select concept_id, record_count, descendant_record_count, person_count, descendant_person_count
from results.achilles_result_concept_count
where concept_id in (?,?,?,?);

However, the achilles_result_concept_count table doesn’t have a person_count column. This table was created using the achilles() function in the Achilles package, but the source code also doesn’t suggest this table should contain the column mentioned: https://github.com/OHDSI/Achilles/blob/main/inst/sql/sql_server/analyses/create_result_concept_table.sql. Am I missing something?

  1. I’m not able to query the BigQuery database from Hades using the DataConnector package. This is what I’m trying:
library(DatabaseConnector)
library(SqlRender)

connectionDetails <- createConnectionDetails(
  dbms = "bigquery", 
  connectionString = "jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;ProjectId=<project>;OAuthType=0;OAuthServiceAcctEmail=<service_account>;OAuthPvtKeyPath=<key_path>;",
  user = "",
  password = "",
  pathToDriver = "<driver_path>"
)
connection <- connect(connectionDetails)

care_site <- querySql(
  connection,
  sql = "SELECT * FROM care_site",
  errorReportFile = "errorReport.txt"
)

I seem to be able to establish a connection, which also shows up under the RStudio database connections. However, when sending a query I get the following error report:

DBMS:
bigquery

Error:
java.sql.SQLException: [Simba][BigQueryJDBCDriver](100032) Error executing query job. Message: EXEC_JOB_EXECUTION_ERR

SQL:
SELECT * FROM care_site

R version:
R version 4.2.1 (2022-06-23)

Platform:
x86_64-pc-linux-gnu

Attached base packages:
- stats
- graphics
- grDevices
- utils
- datasets
- methods
- base

Other attached packages:
- SqlRender (1.13.1)
- bigrquery (1.4.1)
- Achilles (1.7)
- DatabaseConnector (6.1.0)

When I try to query the database with the packages bigrquery and DBI, I am getting results, so the connection doesn’t seem to be the issue. Am I doing something wrong, or are there any known issues with the DataConnector and SqlRender packages?

Ok, this could be an oops on my part: This commit made the achilles_result_concept_count a required table, and I didn’t know it was an Achilles table in the first place! I thought it was a WebAPI specific optimization table to collect record counts from the Achilles results which is very slow to fetch ‘on the fly’. This is the DDL that builds the table for WebAPI. You can overwrite your achilles_result_concept_count table with this, and I think we should think about either removing that table creation from achilles, or just update the Achilles process to use this new code.

Thanks! This seems to have worked. FYI, the demo dataset in the Broadsea image also contains this same bug.

Regarding the Hades connection issue I also found a solution. I was using the newest (1.3.0 1001) version of the JDBC driver, but this one seems to be incompatible with DataConnector/SqlRender. When using the 1.2.25.1029 version of the driver, I am able to execute BigQuery queries.

Follow up on this one. I’m trying to set up a new WebAPI 2.13.0 instance using exactly the same settings as before. However, my container keeps restarting every few seconds and I cannot really figure out what goes wrong. The only thing I can find in the docker logs is:

org.flywaydb.core.internal.dbsupport.FlywaySqlException: Unable to obtain Jdbc connection from DataSource

SQL State : 28P01
Error Code : 0
Message : FATAL: password authentication failed for user "postgres"

However, I did not change anything to the user or password. Does this tell you anything? Probably I forgot something, but I don’t have a clue at the moment.

When loading Atlas this causes a 502 error: Failed to load resource: the server responded with a status of 502 (Bad Gateway)

I can say that Flyway ensures that the database is in correct working order, and if the logs report that it was unable to authenticate to postgres database, the process will shut down.

Which version of Broadsea are you using?

I’m still using Broadsea 2.0.0, since I got this to work before. I tried 3.0.0 also, but without any success yet. Does the error message tell you anything?

The problem definitely seems to be in the WebAPI. When I just take the latest version, add the default settings to WebAPIConfig and build/run it using the webapi-postgresql profile, it almost crashes instantly. Could it be that something changed in one of its dependencies? Because it used to work fine.

So this postgres account, you can confirm it works otherwise? E.g. in a JDBC application like pgAdmin or SqlWorkbenchJ (or even R)?

I’m not sure whether I’m able to check this. It is the default postgres user on the default WebAPI postgres database. I assume the WebAPI needs to be running in order for me to check this? Also I’m not sure whether the problem is in the user, or in the JDBC driver.

t