OLYMPUS does not run on RedShift

FYI. Tried and failed to run Olympus with the local web api pointing to a RedShift database using Postgres SQL. Web api returned an error saying the a ‘create sequence’ command failed. Followed some packages used in the web api and if I did not get lost, the web api uses Flyway which uses Spring Batch. Spring Batch wants to create some tables and sequences in the target database. However RedShift does not support sequences.

Did a Google search on RedShift and Spring Batch. Saw one hit where the person talked of implementing Spring Batch on RedShift but no follow up.

Hi @DTorok,
I’m guessing that from a WebAPI perspective, redshift is not going to be able to be supported. We have similar circumstances in our own environment with PDW. Things like identity columns, foreign keys, default values for inserts can only be constannts (no insert into log table with getDate()), and others. While we have SqlRender to allow us to write cross-database platform sql that will execute on redshift, there are other functions in WebAPI that are not supported on these ‘parallel database platforms’.

I’d suggest you look into the latest version of WebAPI from github, it supports multiple CDM databases as ‘sources’ and you’d configure the WebAPI database to be hosted on a standard PostgreSQL host (not redshift). Then all the CRUD operations in webapi (such as batch jobs, cohort definition editing, etc) are stored in the ‘normal’ database platform, and you execute analysis queries (cohort generation, achilles analysis, heralcles analysis, calypso analysis) against the CDMs (which can be on redshift). We can offer some help on setting that up if having this sort of separation is feasible for you.

-Chris

Chris,

Thank you for the suggestion. Separating Spring Boot from the CDM
operations seems like a solution. Your offer to help on configure WebAPI
is appreciated. Where are these database connections defined?

Happy New Year – Don

Don,

If you get the WebAPI from git, that version has support for different cdm databases, and the database configuration for each separate CDM is stored in a database table that WebAPI uses (this will be on the standard PostgreSQL instance).

I have a work-in-progress setup guide for getting WebAPI installed from a GIT repo. This is fairly technical, and so if you’re not familiar with git, maven, or inserting some data manually into database tables, this might be hard for you to do. But, here’s the link to the document:
https://drive.google.com/file/d/0B-pFyr5HBa0LcDhyTVJES1lSdG8/view?usp=sharing

If you want to give it a try, I can walk you through this. The main challenge is that there’s some manual steps that have to be completed to get the different cdm ‘sources’ registered in webapi, and some additional table creations that are required inside the RedShift environment to support the analytical tools.

-Chris

I have installed the web api pointing to a postgres server to get the setup without having to worry about redshift. But now I would like to change the configuration to point to RedShift for the CDM and vocabulary files. Any suggestion on what configuration values need to be changed is appreciated.

In your postgreSQL database (let’s call this the configuration db), there’s a table called SOURCE and SOURCE_DAIMON. Inserting records into these tables will get you to point to the other, external databases, but you will create tables manually in your redshift environment to store the results of the various WebAPI functions:

cohort (*)
feas_study_index_stats
feas_study_result
feas_study_inclusion
heracles_reults
heracles_results_dist

(*) Note: this is a different table than the cohort table in the CDM.

Please see this link https://github.com/OHDSI/WebAPI/wiki/Source-Configuration about how to insert the data into the SOURCE and SOURCE_DAIMON table. You can also see example INSERT statements at the WebAPI installation guide:
https://github.com/OHDSI/WebAPI/wiki/WebAPI-Installation-Guide

-Chris

Chris,
Thanks, getting closer to configuration. I see that I can define different sources via webapi.source table and then associate the daimon to the source via the source_id. Also understand there are 4 daimon types: 1) CDM; 2) Vocabulary; 3) Results; 4)Evidence.

Still some questions:

  1. the reason I am using multiple database is because RedShift does not support sequences needed by flyway. Which of the daimon corresponds to flyway and as a result must not be RedShift?

  2. Your reply says to create cohort table that is not the CDM cohort table. Obviously it cannot go into CDM schema. In my initial setup (pointing all sources to local postgres) I see 35 tables were created. Is there a way to identify which tables are used to support which daimon?

Hi, Don,
Re:Question 1: Yes, we have the same issue in our own internal Large Data Appliance where it doesn’t support identity columns, etc. That’s why we extended webapi to execute the analysis queries (such as cohort creation) against any datasource, as long as the analysis query doesn’t leverage anything like identity columns. So, you should be able to have the webAPI (via the flyway configuration parameters in the maven’s settings.xml file) configured to point to postgresql, and the individual CDMs defined in the source table can be hosted on redshift, and the JDBC url in the SOURCE table will point to your redshift server…

The second part, the SOURCE_DAIMON table, contains the ‘table qualifier’ (or database schema) that should be appended to table names when performing the queries (we don’t use anythign like a ‘default schema’ in our queries, all tables are explicity qualified’. So if your cdm schema is called ‘cdm’, you’d insert a row into the SOURCE_DAIMON table (source ID 1 is assumed to be the connection settings to your redshift) like so:

INSERT INTO ohdsi.source (source_id, source_name, source_key, source_connection, source_dialect) VALUES (1, 'My Cdm', 'MY_CDM', '{your redshift jdbc connection string', 'redshift');

INSERT INTO ohdsi.source_daimon (source_daimon_id, source_id, daimon_type, table_qualifier, priority) VALUES (1,1,0, 'cdm', 0);
INSERT INTO ohdsi.source_daimon (source_daimon_id, source_id, daimon_type, table_qualifier, priority) VALUES (2,3,0, 'results', 0);

(This assumes that you have a schema on your postgresql database called ‘ohdsi’) This will configure the daimons to look for the CDM tables in the cdm schema (the table_qualifier was 'cdm) and the results tables are in the results schema, both these schemas defined on y our redshift database.

When you start up the tools, you will have the choice of executing a cohort defiintion query against one of your configured SOURCEs. Using the information in the DAIMON table, it will know which schema to look for the CDM and RESULT tables.

Re: Question 2: we need to create documentation about which tables need to be created in the external SOURCE database to support each daimon. The CDM daimon is simply the CDMV5 tables, the the RESULTS is something that needs to be documented. But I can tell you here, you can create the DDL for the following tables in the postgresql WEBAPI database and apply it over to your redshift database (these tables should not have any redshift restricted column defs)

cohort (*) (for CIRCE)
feas_study_index_stats (for CALYPSO)
feas_study_result (for CALYPSO)
feas_study_inclusion (for CALYPSO)
heracles_reults (for HERACLES)
heracles_results_dist (for HERACLES)
(Note: all of the above tables belong in the RESULTS schema)

I say you need to create a cohort table in the ‘results’ schema because the WebAPI tools write the data to the results schema, and the cohort definition results (the people found by the cohort defintion) are written to {results schema}.cohort. The CDM’s cohort table is not used by WebAPI.

-Chris

Don,
I think I should also describe our own internal setup so you can see how it mirrors yours:

The WebAPI (configrued by Flyway) is hosted on a Microsoft SQLServer 2012 database.
We have a large data appliance called PDW that hosts 7 cdm databases.

When configuring the WebAPI (using the installation guide, the section on configuring the settings.xml for maven), we have connection string pointing to MSSQL, and a application user, and an admin user (flyway account) configured.

To run analysis on our external CDMs, we have to do the following:
Each CDM gets a SOURCE record with the jdbc URL set to the PDW server, and the dialect is set to ‘pdw’ (in your case, you’d have jdbc URLS to redshift, and dialect set to ‘redshift’)
For each CDM source Record, we add the following DAIMONS:

  1. CDM daimon with a table qualifier of ‘dbo’ (we put the CDM tables into the default schema on PDW, which resembles MS SQL)
  2. Results daimon with table qualfiier of ‘results’. In this schema we have to create the tables I’ve listed above into a schema called ‘results’ on the PDW server.

That’s it! Once this is configured, the queries are generated to point to the correct schemas depending on the activity: queriying CDM tables uses the CDM daimon talbe qualifier, and writing to the results tables uses the Results daimon table qualfiier. On the front end tools, we read the results from a SOURCE by looking up the results table qualifeir, and querying the tables from the proper results schema.

Hopefully this makes sense. let me know if you have any troubles.

-Chris

I am confused by the source/source_daimon relationship. My assumption is that the source_daimon.source_id points to the source record used to access the database. But in the above example you say to add a source record and then two source_daimon records. The first source_daimon record references source_id 1, which is your CDM data on the PDW server, that I understand, but the second source_daimon record added for results references source_id 3 ( which i do not know anything about) and the daimon_type is again 0 (cdm).

This is inconsistent with my understanding that 1) the source_daimon.source_id reference the source record that points to the database. And that 2) the daimon_type values are 0:CDM, 1:VOCABULARY, 2:RESULTS and 3:EVIDENCE

I would think that for the results, the source_daimon.source_id could be the same as the CDM since they are both on the same system (PDW in your case), but that the daimon_type should be 2 for results.

Yes, I’m sorry, I mis-copied the example:

INSERT INTO ohdsi.source (source_id, source_name, source_key, source_connection, source_dialect) VALUES (1, 'My Cdm', 'MY_CDM', '{your redshift jdbc connection string', 'redshift');

INSERT INTO ohdsi.source_daimon (source_daimon_id, source_id, daimon_type, table_qualifier, priority) VALUES (1,1,0, 'cdm', 0);
INSERT INTO ohdsi.source_daimon (source_daimon_id, source_id, daimon_type, table_qualifier, priority) VALUES (2,1,2, 'results', 0);

Thanks for pointing that out, I’ll go back to my documentation to make sure the insert examples are defined correctly.

I believe we have the WEB API working for the vocabulary. Thanks for the help. Checking other calls.

Are all call in WebAPI CDM aware? From what I can find, the service to get a concept (WebAPI/[source_key]/vocabulary/concept/0 assumes that the concept table has the column standard_concept which is not valid for cdm 4.