OHDSI Home | Forums | Wiki | Github

Cohort generation computing time

I have looked at several examples of cohorts and tried to regenerate some of them with Atlas.

Synpuf 1% (I assume it means 1% of the total synpuf) is tiny. Although the data source for the demos is miniscule and the inclusion criteria are very simple, this cohort builder took over a minute.

If I simply extrapolate the time to the full synpuf dataset (it is still TINY in comparison to real life EMR datasets) it would take over two hours. If I were to extrapolate this to TERABYTE sized datasets we would be looking at 1+ days.

Any question of “interactive” cohort building would be a complete fantasy.

Does anyone know what kind of hardware is being used to run the demo?

Assuming some people have installed the Atlas software in their environments, can you share your observations how long it takes to build cohorts and what computing resources you have to use for that?

The ATLAS that’s available on the ohdsi website is on a small box running
postgresql using toy data. It is purely to illustrate the toy, not to
demonstrate the performance, and I would not project anything from your
experience there.

In my organization, we run ATLAS against MS APS database environment, and
running many concurrent sessions with extremely complex cohort definitions
against multiple datasets over 1TB (containing >350m lives in aggregate)
and we achieve quite desirable performance. Simple queries are mostly
sub-minute, complex ones rarely more than a few minutes, never more than an
hour. I’ve seen similar performance at other sites who are running on
different infrastructure (e.g. SQL Server, Redshift, Postgres),
particularly on smaller datasets (when data contains ~1m instead of ~100m
patients, I would expect you to execute most queries in <1 min).

All that said, I’m not sure what your expectations in terms of ‘interactive
cohort building’ are. If you are interested in a real-time experience,
where analyses are being performed as you type your code, then the ATLAS
infrastructure isn’t for you (though the other ATLAS tool built by
Stanford by @nigam and friends is pretty close to that). For me, I
definitely want near-real-time performance in browsing the vocabulary to
create conceptsets and define the cohort logic, but once i’ve set up my
cohort definition (a process that requires critical thought and generally
takes several minutes), I’m ok with hitting ‘generate’ and waiting a minute
to get back the summary report. Others might have more ambitious
expectations, in which case I encourage them to join the community and help
build out the stack to meet those needs.

Thanks for responding. Is MS APS running on Azure? How many nodes, cores, etc… or is it a single in-house system?

Also, what is the tool you mention from Stanford called? Is there a link to learn more about it?

The Janssen APS environment is not running in Azure, but is on-prem.
10 servers: each with 256 GB physical RAM, 2 Xeon CPUs, 32 cores.

We’ve also seen comparable performance with Redshift clusters, specifically dc1.8xlarge node types, roughly 4 to 8 nodes.

The Stanford tool serves a specific purpose, where we need sub second response times for an “Informatics Consult” service.

You can see a video at http://inf-consult.stanford.edu. The tool has similar response times with data up to ~120 million lives x 7 years.

Regards
Nigam

Thank you, @nigam. When you say your tool serves a specific purpose, does it mean that it is not an appropriate tool for cohort building?

Are you using SQL underneath your query statements?

To have the real-time response shown in the first video (I did not get sound on the second) what underlying cluster specs are you using?

Thanks.

Hi @optimizer

I am part of a team at QuintilesIMS that builds the E360 ‘Cohort Builder’ which is capable of very quick, complex cohort definitions and queries. We currently have 700+ million patient lives loaded into the the tool, which is a growing number as we add further datasets from across the world. Hardware wise, we use MS SQLServer as our backend with a combination of Sharded Database servers and some very cool patented search algorithms!

A lot of our datasets are in the OMOP schema (which E360 talks to natively), and we have recently done work with the ATLAS team to allow patient cohorts generated from our tools to be passed to ATLAS for analysis.

If you would like more detailed info, just reach out

It is definitely for cohort building, it’s for real-time cohort building.
Please see the adjacent 40 min video for details. Most use cases do not
need that level of response time as Patrick (@patric_ryan) mentioned. For
our use case, we do … and hence we built the new search engine.

There is no underlying SQL (or any other off the shelf thing). It’s
something we built from scratch. Our system needs a linux OS, Java, and a
Web server. The instance with ~1.7 million lives is running on a small VM;
no cluster needed. For scaling to the ~120 million lives range, we use a
couple of compute instances of GCP … and costs $151 per day to get that
performance with the larger dataset.

Interesting video, @nigam. Would you say that the stumbling blocks to widespread adoption are 1) Clinicians don’t really want to do programming 2) A leap of faith is needed to go from conclusions about a cohort population to making an assumption that an individual who is a part of a cohort is likely to respond similarly to other member of a cohort? (My feeling is there a legal question, hiding there somewhere)

@nigam p.s. Did your team feel that only a language would be expressive enough to do what you wanted? You did not feel that a GUI would be up to the job?

The goal is to have a skilled data scientist in the loop before one makes a
conclusion. Reg, the GUI, it’s beyond my skill set to design something that
provides the expressivity we need to build ad hoc cohorts, and do that
fast. Given unlimited funding, it might be possible … but beyond the means
of an academic!

@JWickson What kind of performance do you get (with 7e8 lives) using how many nodes, RAM, etc…? Do you cache anything in RAM?

Let’s say I want to assess the efficacy of a medical intervention versus a procedure. Your system gets a list of related condition codes to include and some disqualifying conditions to exclude, a list of drugs to include and a list to exclude, and and code for a procedures. In addition, lets say it is going be a pediatric cohort, with a defined observation period.

Can you assess time required for someone to enter the parameters, and then the time to actually build a cohort, i.e. a list of person_id, plus the relevant data attached to the list?

Thanks.

Again, @nigam, thank you for taking the time to explain your work. As I watched the video more carefully, additional questions came to mind. From the mode of entry of additional filtering criteria it appears to me that the intermediate cohort results are being cached, to be subjected to further filtering, if desired. Correct me if I am wrong, but after your tool filters for the condition code, for example, that cohort is persisted in memory, so that the next operation operates on that partly distilled cohort, and so on, and so on. Right?

I am curious, for the 1.7 million lives setup (similarly for the 120m if known), what is the total size of the data that includes all columns, not just the columns for the filtering criteria?

After you run your query, do you end up with just the set of person_id(s) or is the result a cohort which contains an entire patient record suitable to be examined on a single person basis?

Thanks

Thanks, @Ajit_Londhe for the details. Can you add any info on how responsive your system is and how much data you are handling? Also, is that 32 physical cores or hyper-threads, i.e. 16 real cores, and 32 hardware supported threads?

Thank you.

It’s a longer conversation; the questions are making certain assumptions about how we do things (and they might not be accurate).

Best done on pen and paper at a future OHDSI meeting :-).

@optimizer Sorry for delay in response, have had holiday and other life commitments!

Performance is near real time, we don’t really do anything in RAM, but do employ SSDs and infiniband networking to achieve some performance bumps across the database servers. Most of our performance needs are met in the software and SQL layer.

And yes you are correct in your assumption, that the slowest part of the system is people and keyboards!

The ‘E360 Cohort Builder’ has the premise of Codelists (Procedure, diagnosis, drug etc) which can be built using our codelists management software, these (as well as test and clinical results) can be combined in a boolean logic expression (AND, OR, AND NOT) using a visual editor, and then time (relative and absolute) filters can be applied across these expressions (e.g. Procedure A occurs after Diagnosis B within 90 days etc), we also have other filters to determine things like gender, practicem, clean periods etc. One last thing the system can do which is unique is use cohort definitions (the boolean expression + time and filters) to use in further cohort definitions, thus allowing users to build up a library of expressions/cohorts for re-use.

Time for building cohort definitions varies from study to study, but it is itterative in nature, so people tend to get quicker the more they use the system, but its certainly less than hours (and often a lot less than that (minutes) for an experienced user).

t