OHDSI Home | Forums | Wiki | Github

Let's talk about Cohort Materialization

All,
Now that we have cohort saving/loading, my next step in CIRCE is to introduce jobs that will take a saved cohort definition and materialize the cohort asynchronously. I’d like to open discussion about how this is facilitated, how it impacts other dependent applications such as HERACLES. My thought is that a cohort generation job (ie: a Spring Batch Job) will have parameters sent to it to specify the cohort definition ID to generate, the target table to persist to, and possibly an optional step to copy the generated cohort to the CDM schema.

I’d also like to talk about any new support tables that we’d need. Perhaps a ‘cohort_generation_statistic’ table which has things like how long it took to generate, the status of the cohort (success/fail), any process messages, etc.

-Chris

Chris, do you have an idea as to who will create OHDSI.COHORT?

I was going to go ahead and add it to my flyway config, but didn’t want to break anything.

I think you can do it. I think as new applications that come up with specialized uses of cohorts they may create their own cohort tables. I’ve left the cohort expression SQL generation open enough to allow you to specify where to materialize the cohort into, so we can just configure the job that materializes cohort to write to your cohort table for Heracles procesing.

@cahilton, I’d like to get involved with creating the job that produces the cohort, but I don’t know if this should be developed on an independent branch, put it right into master, or introduce it into your branch. This job isn’t going to be the only app using this job, but Herecles (and Circe) are definitely the first to need it. Are you envisioning Heracles kicking off a cohort job or do you see that happening in Circe? Either place would be fine since they both talk to the same WebAPI.

This is one of the last tasks I need to finish for a complete Circe app, so I’m very anxious to know how you would prefer to proceed.

-Chris

I always envisioned that as part of Circe, but I’m open to it being part of Heracles too. (I think that may have been a @jon_duke original wish).

I would stick with a separate branch, and I can merge in when I need it.

Ok, consider it done. I’ll put it into a new branch. I’m going to use the CohortAnalysisTask and Tasklet as a model

@alfranke, is there anything I should consider when copying that approach? Such as any lessons learned that I should take into account that might be not done properly in the example?

-Chris

Nothing coming to mind…

Chris,

I see kicking off the job to produce the cohort as a natural final step in
CIRCE. Whether I have just created a new cohort definition or am reviewing
an existing one, that feel like the right launch point.

Creating a cohort in Heracles can be enabled but would be more of an
exception to the typical workflow.

@alfranke, I’m modeling the GenerateCohortTask and GenerateCohortTasklet after the CohortAnalysis{Task|Tasklet} and had a question on the pattern:

Is it necessary to pass the sql to run in the tasklet? I was hoping that the tasklet would work as follows:

  1. Invoke path /cohortdefinition/{id}/generate
  2. Trigger tasklet passing job parameter of ‘cohort_definition_id’ as the passed in paramter. I’m thinking the class ‘GenerateCohortTask’ will just have 1 property: cohortDefinitionId
  3. Inside the CohortGenerateionTasket.execute, I create instance of TransactionCallback(), but instead of calling sql passed into the tasklet, it loads the chort definition from the repo, deserializes out the expression, builds the query, and runs it. I also wanted to take additional steps of after the cohort is generated, to update a cohort generation status (a new table i’d like to introduce) and THEN consider the task finished.

Is there any limitation with spring batch jobs where you can’t query out to a repo during the execution of the task? Maybe the answer is that I query for the cohort definition first and make that a parameter of the task…

The crux of my question is: can I do more work inside the tasklet or should it simply be ‘execute these sql statements’?

-Chris

Re: Job tracking - FWIW, in my other project we created a table like this:

CREATE TABLE `batch_job_info` (
  `job_instance_id` bigint(20) NOT NULL,
  `number_processed` bigint(20) DEFAULT NULL,
  `total` bigint(20) DEFAULT NULL,
  `output_location` varchar(300) DEFAULT NULL,
  `creating_user_id` int(11) DEFAULT NULL,
  KEY `job_instance_fk` (`job_instance_id`),
  KEY `creating_user_id_fk` (`creating_user_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

The job_instance_id joins back to the Spring batch tables to get the status, but this allowed us to store other information about the job, (this is for NLP processing, so how reports rows processed, and how many currently have processed - similar to your example). This table is updated as the job processes, so we can watch it as it progresses through reports.

We can also join it to the user table and see who created it. Obviously, our needs are a little bit different, but it might be a model. Alex may have some other ideas that he’s implemented for this kinds of thing.

This query (we made into a view) joins the table with with spring batch tables to see the status, who created the job, how many reports the process has completed, etc.

SELECT bji.job_instance_id      AS job_id, 
       bji.job_name             AS job_name, 
       bje.create_time          AS date_created, 
       bje.start_time           AS date_started, 
       bje.end_time             AS date_completed, 
       bje.status               AS status, 
       bje.exit_code            AS exit_code, 
       bje.exit_message         AS exit_message, 
       bj_info.number_processed AS number_of_docs_processed, 
       bj_info.total            AS total_docs, 
       bj_info.output_location  AS output_location, 
       bj_info.creating_user_id AS creating_user_id, 
       u.username               AS creating_username 
FROM   (((batch_job_instance bji 
          left join batch_job_execution bje 
                 ON(( bji.job_instance_id = bje.job_instance_id ))) 
         left join batch_job_info bj_info 
                ON(( bji.job_instance_id = bj_info.job_instance_id ))) 
        left join users u 
               ON(( bj_info.creating_user_id = u.pk_id ))) 

Nice! Yeah I imagine that we’ll have context specific views that we’ll reach out to application specific tabels to join into job tables to get additional details about the activity, just like you have with your batch_job_info table.

Thanks for showing us this!

t