Partitioned project support in OHDSI / WebAPI

Dan_Rodney · April 5, 2017, 9:06pm

We’re interested in supporting many projects in a single OHDSI tools / web API enviroment, satisfying these basic requirements:

Cohorts / user data created for one project are not (at least by default) visible in other projects.
Users can create their own data for the project matching the OMOP schema, and merge it with the main dataset in OMOP, without affecting other projects. This will allow users to create their own observations about people and use them in a subsequent cohort building or analysis.
We do not want to make a copy of the main dataset for every project, or spin up separate databases or servers for each project.

One approach to doing this would be adding row-level security to both WebAPI and OMOP tables, based on some new fields indicating the project owners of those rows. This could work, but isn’t super-portable, and there are some concerns about how it would perform.

Another approach, which we think may work better, is:

Dynamically create a schema for each project, containing all the user created WebAPI tables (e.g. cohort_definition), that only that project can access. (These would need to be deleted when a project is deleted.)
In the same schema (or perhaps a second schema), create project-specific tables with definitions matching the OMOP tables with some prefix (e.g. project_observation, project_measurement), and create views with names matching the names of the OMOP tables, doing a UNION between the OMOP tables and the project tables.
Change WebAPI so that the schema used to access cohorts, etc. and the schema used to access OMOP can be dynamically replaced based on the project being used.

(There may be some limits on number of schemas per database, which may necessitate creating more databases and sharding projects across them; otherwise this should scale reasonably well.)

Has anyone attempted to do this sort of thing before? Is there a better way of supporting projects with separate data in a single instance?

Thanks!

Dan Rodney
Verily Life Sciences

Dan_Rodney · April 5, 2017, 10:14pm

(One complication with the latter approach that occurred to me: when the WebAPI tables change, database migrations would probably need to be applied to tables in all the project schemas, rather than just the single OHDSI schema.)

Christian_Reich · April 6, 2017, 12:23pm

@Dan_Rodney:

Sounds very cool. And very much needed, especially for environments with many disparate data sources.

Couple questions, trying to decypher a little bit what you are trying to do here:

Not sure what you mean. Who are “Users”? The ones converting the data from their sources? Or the ones analyzing them? What do you mean by “merging” with the main datasets in OMOP? Usually, the patient data are read-only for analysts. They are kind of God-given. There is nothing to create and merge.

What do you mean by “project”? We usually think of “studies”, and each study goes against the same instance of the data within each data holder. Usually, folks have barely enough resources to have one copy that will respond at decent speed.

Not sure I understand that either. What do you want to UNION? The data tables won’t change in a study. The analysts might create result tables, which are “project”-specific. They are dependent on the actual data instance, but each data node c
ould create their own identical environment for doing that.

Let us know how you are thinking about this.

Dan_Rodney · April 6, 2017, 4:27pm

Thanks, Christian – responses inline.

Users here are researchers. The idea is to augment the original data set with derived observations or measurements of interest that were not in the original data set. There are a couple possible use cases to support here:

Researchers browse the data available on patients, diving into details on participants of interest, and add their own annotations that weren’t explicitly present in the original data
Researchers use scripts to generate new measurements or observations from the original data

Eventually some of this could feed back into the main curated data set, though by default this added data would just live in the project workspaces.

I think we’re using “project” and “study” interchangeably here.

In your terminology, we want each study to have its own private workspace for user created data (including cohort definitions and user-created observations or measurements), to be served up by shared servers.

Researchers can create their own result tables in whatever form they like for custom analyses; but presumably that means tools that use the OMOP schema with the main data set can’t also make use of them. The thought here is that by taking a UNION of project-specific tables and the main dataset tables, tools that operate off of OMOP can be used with user-created data.

To take an example: let’s say the source data has height, weight, sex, and age for patients but not BMI. A researcher could write a script that inserts BMI calculations into its “project_measurements” table for each patient, such that the resulting “measurements” view contains those measurements along with everything else. Now tools that use the OMOP schema (e.g. for building cohorts or analyzing data) can use this augmented data as well.

Christian_Reich · April 6, 2017, 5:26pm

@Dan_Rodney:

I see where you are coming from. Interesting perspective. Would like to see examples of that. Here is how it jives with the assumptions folks generally make for OMOP-based data:

The patient data are fixed, read-only. It is whatever the data source tells us about a patient. All doctor’s visits, hospitalizations, diagnoses, diagnostic or therapeutic procedures, drug prescriptions, etc.
The OMOP conversion just changes the format and the representation (the coding). It does not change the content.
Researchers are not writing to those OMOP tables. It’s read-only to them.
Except there are exceptions from these rules:

When people run studies, they will create results: summary statistics, scores, rates, relative risks, hazard ratios, etc. We usuall don’t refer to them as “data”, but as “results” (that’s just a convention here). They do not go into the CDM, but are stored somewhere else.
What you seem to be referring to is the development of derived data, and they should go into the CDM. Well, sometimes. There are several flavors:
- The content of the DRUG_ERA, CONDITION_ERA and DOSE_ERA tables are derived, but written into the same read-only schema as the rest of the data.
- If you want to calculate derived covariates, like your BMI, you use the COHORT_ATTRIBUTE table. That, and the COHORT table, are read/write tables and are used during analysis.
- The COHORT_ATTRIBUTE table in particular, which is designed to hold covariates for multivariate analyses, is rarely used by folks. Instead, they create them during the analysis and throw them away.

We currently don’t prescribe the database schemas where read-only and read/write tables go. But we want to add that into the CDM (on my to-do list).

BTW: managing all this in a disparate fashion and run queries and studies remotely (somebody who doesn’t even have direct read access) folks are working on a solution. You may want to talk to @gregk and his Arachne development.

Let us know if you have specific ideas or needs, and we bake it all in.

Dan_Rodney · April 6, 2017, 5:51pm

Thanks – this is very helpful!

Having the ability to create cohort definitions and custom result tables in a project-specific schema / namespace (with permissions restricting who can access it) is a strong requirement for us; being able to have user-created data show up in tools that consume OMOP data is lower priority.

So we will likely want to focus on the former first – having a way of creating and using cohort tables, etc. in project-specific schemas within the same database / server environment. I believe this could be accomplished with some changes to WebAPI. Does this sound right to you? What would be appropriate next steps?

lee_evans · April 6, 2017, 7:54pm

@Dan_Rodney

You mentioned that you don’t want to spin up new servers for each project but have you considered a Docker container based approach? It would scale out on a Docker Swarm/Kubernetes/Mesos Marathon cluster to support multiple projects.

When a new project/study is initiated a new OHDSI tools/WebAPI Docker container would be launched which is configured to access the database schema created for the project.

My company maintains the OHDSI webtools Broadsea Docker container images. You can read about them here:

Private message me if you would like more info on a container based approach.

gregk · April 7, 2017, 12:42am

@Dan_Rodney

Dan - sounds very interesting. WIll try to summarize here what you are trying to do, just to make sure I understand it correctly:

You want to support multiple study projects by giving them their own "private workspace" where they researchers define and manager study specific cohort definitions and other study related objects
Share one copy of OMOP CDM data schema between projects
Enable researches in projects to annotate data in shared OMOP CDM data schema, yet have an option to keep annotations private or public (shared between researches)

I am not sure about your comment regarding users creating their own data. The data in CDM instances are typically coming from EHR, Claims, Lab sources where it is generated by healthcare professionals. Who are your study users and how would they be able to generate patient specific data, unless you mean annotating existing patient specific data with some insights?

What is the source of your OMOP CDM data? How many separate schemas are you planning to have?

At this moment, ATLAS - being a great tool allowing you to design very sophisticated and flexible cohorts and studies - is not a true muti-tenant environment. This is something we have been discussing thought, would be really good to clarify your requirements and combine our efforts?

Odysseus (my company) has created a study management platform - Arachne - that would sits on a top of ATLAS and allows researchers to have study specific workspaces and help them to conduct end to end studies. It does sound like it might answer at least some of your needs. Please ping me and I would be very interested to talk you about your requirements and take you through Arachne capabilities and our product roadmap

Chris_Knoll · April 7, 2017, 1:00am

Is there any problem with setting multiple WebAPIs per ‘workspace’? Different WebAPIs can point to the same CDMs for the patient-level data, but must point to their own results schema for purposes of storing analytical results. Would that work for the case described above? Seems like it does work for #1 and #2 above, but the notion of ‘annotation of data’ I don’t quite understand how the OHDSI tools support that (we don’t write data back into the CDM for annotation, I’m not even sure where annotations of data fits into the CDM model)…

Dan_Rodney · April 7, 2017, 6:26pm

@Chris_Knoll and @lee_evans: it is possible that having multiple instances of WebAPI (dynamically spun up and configured to point at dynamically created schemas) could accomplish what we want – possibly using docker containers as you suggest, Lee – without any changes to WebAPI itself. We will definitely consider whether that’s an approach we’d like to pursue. Part of the goal here is to make sure that people on the web can set up accounts and projects for free; we would need to ensure that this sort of approach wouldn’t be too resource-intensive.

@gregk: primarily what we have in mind about user-created patient data is annotating existing data with insights, or calculating fields that weren’t present in the original CDM (e.g. our BMI example.) This might include generating new annotations from EHR records, etc.

We’re definitely interested in providing a multi-tenant experience for OHDSI tools, without requiring a ton of resources or any manual setup per project. Whether that’s done by making OHDSI instances capable of connecting to different schemas depending on the user, or providing dynamically spun up instances of OHDSI tools pointed at different schemas as Lee suggests, is an open question.

jon_duke · April 9, 2017, 8:55pm

Just wanted to put in a +1 for this thread and thanks to @Dan_Rodney for highlighting key requirements and potential solutions for multi-project environments. We have a similar need at Georgia Tech and would be eager to collaborate on open-source approaches to this.

OHDSI has definitely made progress with daemons and authentication (thanks to @Chris_Knoll, @Frank, et al). But in terms of multiple WebAPIs, we need to find a less hacky (and dare I say elegant) solution to working with OHDSI in complex research environments.

Jon

sduvall · April 12, 2017, 9:53pm

Thanks @hripcsa for pointing out this thread to us. This has been an important requirement for us in VA as well. Our current environment is set up using Windows Authentication. Each project has a separate MS SQL Server database with views filtering off the complete VA OMOP repository. With the new WebAPI features, users are able to select which project database to point to, but as I’ve interpreted from discussions with @mmatheny and @Frank, some things like error handling don’t graceful propagate back to the UI.

VA would be happy to contribute to the discussion of what is needed, possibly be able to contribute code, and be a use case for testing updates. As VA is doing pilots with different technologies, now is a great time to explore better ways to design partitioned project spaces.

Scott DuVall
Department of Veterans Affairs