OHDSI tables architecture - deploying on multiple dbms platforms

Gowtham_Rao · October 22, 2017, 7:25pm

@Frank benevolent dictator - Follow-up discussion after the wonderful tutorial you and others led at the 2017 Symposium.

How are the OHDSI tables created:

CDM tables created using DDL scripts
OHDSI schema and Results schema created by Flyaway.

Use case: when an instance has to be deployed over multiple DBMS for various reasons such as when primary platform

does not fully support the Hibernate ORM
does not support sequence function to obtain the cohort_definition_id or concept_set_id etc.
Here one platform maybe considered the primary platform (redshift) and the other a secondary platform. During computation ‘Joins’ may not happen between data from primary platform to data in secondary platform. Patient level data has to be on primary platform.

What needs to be solved: Confusion occurs when deploying over multiple platform (e.g. Amazon Redshit and PostGres). Almost all computation tasks run on the primary platform (Redshift like MPP systems)

What is needed: Documentation for each table as to what table maybe

only deployed in primary platform
maybe deployed in either primary or secondary platform.
finally update this page

Also needed: some brief description of the non-CDM tables created by FlyAway.

I started this process - maybe as a community we could complete them? Once we have created a final version - we can post it in the documentation section of CDM Github page?
Can we update the spreadsheet here - Editable version

Source schema:

contains source data
in native schema of source data

CDM schema:

All standardized data and tables in CDM schema
Has to be on the primary platform
Contains patient level data
Read only, only changed during ETL.
All the CDM tables are here and includes the vocabulary-tables.
One correction: ‘Cohort’ table is currently considered a CDM schema, but this is legacy from CDMv4 retained in CDM v5 to ensure backward compatibility StandardizedDerivedElements. There is a proposal to CDM workgroup to take it out of future version CDM v6. (@Frank and @clairblacketer where? cant find it - maybe it is lost? Is it this one? or this one . Maybe combine them into one issue and repost. no table name should be re-used across schemas)

Results schema

Output from CDM schema are inserted into Results-schema
Has to be in the primary platform when output from CDM schema needs to be inserted into Results schema
Derived, may contain patient level data (e.g. cohort) or summary statistics
include tables for achilles results, cohort generation, heracles results, estimation results, etc.
for every CDM schema there is one dedicated results-schema.
For example, a read-only schema with CDM data may be in schema dataA. To allow (for example for Achilles) to write to result tables (e.g., achilles_results table), a corresponding schema (with write privileges) would be dataA_results. OHDSI software tools do not try to manage various datasets, they rely on separation of results by simply separate database result schemas for each dataset. So, if you have a CDM in schema ‘dataA’ and it is read-only, you can create a new schema ‘dataA_results’ that is writable, and when you execute Achilles, specify dataA as the cdm_database_schema and dataA_results as the results_database_schema.

OHDSI schema

administrative schemas
The OHDSI Platform Metadata that describes the concepts and processes that were used to derive the results schema from the CDM schema are stored in the OHDSI schema.

Source Data: Person level data that is acquired through different parts of the healthcare system. These sources are said to exist in the “native schema”.
Standardized Data: Data that has been converted into the OMOP CDM format. These data are said to exist in the “CDM schema”.
Derived Data: Data that is the result of some form of processing on data that came from the OMOP CDM format. These data are said to exist in the “results schema”. Most of these data are aggregate summary statistics. Exceptions are Cohort table that contains subject_id.
OHDSI Platform Metadata: Metadata that is created or maintained to describe a concept, process or system function within the broader OHDSI architecture. These data are said to exist in the “OHDSI schema”.
Reference

@Frank Can we create a new OHDSI ‘architecture’ repository in GitHub? The wiki pages in ohdsi.org for architecture are confusing to navigate. e.g. i cant get to the pages referenced from here

Frank · October 24, 2017, 5:57pm

@Gowtham_Rao - thanks for putting this all together!

How are the OHDSI tables created:

CDM tables created using DDL scripts

OHDSI schema and Results schema created by Flyaway.

Yes, as referenced in the page you reference, you have to manually export and then create the “results” schema tables along side each CDM that you add to your installation. This is not ideal and I hope that the “shared results” service we’ve
been talking about on the architecture call becomes one way in which we improve this situation.

I am generally unhappy with the way we leverage Flyway in the architecture and we’ve started talking about alternate approaches.

The proposal to remove the Cohort table from the CDM schema is here.

As a community we’ve gone back and forth on where the wiki should live. We have the wiki outside of github because not everything we want to document has had a GitHub repository. I think that as we create our “specification” repository, whatever that ends up being, that perhaps that would be a good place for the architecture wiki to live.

Chris_Knoll · October 24, 2017, 10:32pm

@Gowtham_Rao:
One correction I’d make is that the ohdsi results schema is not managed by flyway. Just the tables found in the configuration database. Automatic migration of the configuration DB has been a great benefit since you rarely see questions asked about the spring batch job tables or the other configuration tables, but we routinely receive questions about how to set up the ohdsi results schema in each cdm. It would be great if flyway managed the whole thing, since people wouldn’t need to manage this by hand.

Gowtham_Rao · October 24, 2017, 11:30pm

Thanks @chris_knoll and @frank

Maybe if you could look at the Google spreadsheet I referenced in the first post – and put your comment or make changes directly into that spreadsheet?

It has all the tables from all schemas that I know of.

Chris_Knoll · October 24, 2017, 11:46pm

Sure thing, I’ve made some updates related to the cohort generation info, and cohort definition table. Also there’s 2 ‘cohort’ tables (one in cdm, one in results) I’ll see if they are accounted for separately.

Thanks for putting together the spreadsheet, it’s very helpful.

Chris_Knoll · October 25, 2017, 12:29am

@Gowtham_Rao:
I updated with the spreadsheet with the information I could find. I also re-sorted the spreadsheet just to help me find things and organize, hope that’s not a problem! There’s a few tables remaining that I couldn’t classify related to evidence and laerates. @anthonysena, could you fill in the gaps there?

Also, you might find that the webapi creates tables in the ohdsi repository schema that are categorized as results schema. Those tables created in the ohdsi repository schema are just provided as exemplars for assisting deployment to the results schema in the target CDM. I’ve marked those tables as ‘results schema’ in the spreadsheet. If we find it confusing, we can always drop those tables from the repository schema, but then we’ll have to update the instructions to provide the DDL to execute on the target CDM’s results schema.

gregk · November 12, 2017, 1:17pm

Hi Gowtham. fantastic thread - thank you. We will definitely be adding to this

When deploying to multiple databases, the tables in Results will not be created automatically but need to be created manually instead. I am talking to multiple folks who created those DDL scripts for various dialects already - including @Rijnbeek and @Frank - about potentially reusing them. A long term option would be, of course, to automate this as well.

Gowtham_Rao · November 15, 2017, 4:03pm

@gregk @Frank @Chris_Knoll

how does the webapi work in a multi dbms deployment. If the ohdsi schema webapi is on postgres and cdm/results in on primary dbms (redshift, impala, netezza etc).

GET http://www.example.com/cohort/id
Path Parameters

id
number
http://webapidoc.ohdsi.org/job/WebAPI/WebAPI_Miredot_Documentation/index.html#-304091199

It looks like GET calls is pointing to tables managed by Flyway. In multiple DBMS deployment, where the results tables need to be manually deployed outside of Flyway, how do we point the webapi to the results tables in redshift, impala etc?

Chris_Knoll · November 15, 2017, 4:05pm

When WebAPI starts up, it manages all of the tables that are configured in the database connection found in the XML configuration (the local configuration XML file that overrides the defaults in the public POM.xml…apologies for getting too technical for a second) via Flyway. After that, the results schema in the CDM databases that are specified in the ohdsi.source table need to be manually deployed.

Chris_Knoll · November 15, 2017, 4:08pm

@Gowtham_Rao,
The steps for adding a new CDM source for use by WebAPI are:

manually create the results schema and tables via manual DDL execution
Insert record into webapi.SOURCE with a new source ID, source key, JDBC URL and dialect
Insert record into webapi.SORUCE_DAIMON referencing the new source_id, but specify the table qualifiers for the results schema and cdm schema (1 record per table qualifier)
If webAPI is running go to /WebAPI/sources/refresh to refresh the cached sources on the server.

Then you should be able to see the new sources in ATLAS and select them for generating cohorts or executing other analysis types.

-Chris

MaximMoinat · March 14, 2018, 4:44pm

For a second cdm data source in Postgres, I have been going through the steps outlined above. It worked perfectly, except for one point. I also needed to execute Achilles to be able to generate cohorts. Otherwise the query fails. Is it indeed possible that cohort generation depends on Achilles tables or was this a coincidence?

Chris_Knoll · March 14, 2018, 6:52pm

No, cohort generation doesn’t have any dependency on Achilles tables. Do you know what error you encountered?