CohortDiagnostics: What is the purpose of JSON / can one avoid providing it?

OskarGauffin · May 13, 2022, 9:46am

Hello! I’m interested in setting up CohortDiagnostic for cohort that I’ve created using “hand written” SQL, i.e. no ATLAS.

But it seems as if I’m not allowed to, CohortDiagnostics seems to want a JSON-file. I don’t understand why that’s needed, and would like to know if there’s some workaround, for instance an “convertSQLtoJSON”-function?

Gowtham_Rao · May 13, 2022, 12:15pm

That’s a good question.

Starting version 3, cohort diagnostics does NOT generate cohorts. So it does not need it for that purpose.

Cohort diagnostics has diagnostics related to concept set that requires Json. The incidence rate computation requires the Json. Both require SQL that is generated using Circe.

If you are willing to turn those two diagnostics to off, then you can provide any"dummy" Json and you should get the other diagnostics.

I have not tested this out. Please let me know your experience

OskarGauffin · May 13, 2022, 12:11pm

Thank you very much for answering, most helpful.

In terms of dummy-JSON, we actually put together some JSON-file manually which at least makes it possible to run CohortDiagnostics, but we felt uncertain how it’s being used in relation to the SQL, does the JSON affect anything, in case we messed up constructing it.

Is it correct to say that as long as we’re turning off the incidence and concept set-tabs, we can provide whatever JSON that makes CohortDiagnostics run, and the output will be based on the SQL and not the JSON?

Perfectly OK to advice that we should go through the code and turn off incidence and concept set-calculations as well, just asking in case we could save ourselves that little endeavor.

Gowtham_Rao · May 13, 2022, 12:31pm

Its not a simple answer - but in theory you can do what you are trying to do.

Note: We required CohortDiagnostics to take as the input to be an object called ‘cohortDefinitionSet’. This object is defined by the OHDSI/CohortGenerator package. In addition, we require that object to have the field json, cohortId, cohortName and sql. The reason is cohortJson, as generated by OHDSI circe library, is parsed by CohortDiagnostics and its companion DiagnosticsExplorer shiny app.

The check is performed here.

github.com

OHDSI/CohortDiagnostics/blob/5f4d80e9f4210ffaf2ac7d4ab2b26102e4987c58/R/RunDiagnostics.R#L256

    
      
                                   "cohortTable",
                                   "cohortInclusionTable",
                                   "cohortInclusionResultTable",
                                   "cohortInclusionStatsTable",
                                   "cohortSummaryStatsTable",
                                   "cohortCensorStatsTable"
                                 ),
                                 add = errorMessage
          )
          checkmate::assertDataFrame(cohortDefinitionSet, add = errorMessage)
          checkmate::assertNames(names(cohortDefinitionSet),
                                 must.include = c(
                                   "json",
                                   "cohortId",
                                   "cohortName",
                                   "sql"
                                 ),
                                 add = errorMessage
          )
          
          
cohortTable <- cohortTableNames$cohortTable

We pretty much export the content of this object, including JSON, as output - here https://github.com/OHDSI/CohortDiagnostics/blob/5f4d80e9f4210ffaf2ac7d4ab2b26102e4987c58/R/RunDiagnostics.R#L435

‘DiagnosticExplorer’ shiny app - parses the JSON to show the details of the cohort in the app.
ConceptSetDiagnostics (runIncludedSourceConcepts, runOrphanConcepts, runBreakdownIndexEvents) requires cohortJson because it extracts information about the conceptSet like here
https://github.com/OHDSI/CohortDiagnostics/blob/5f4d80e9f4210ffaf2ac7d4ab2b26102e4987c58/R/ConceptSets.R#L131
Incidence rate needs cohortJson - because it tries to get the washoutPeriod from the json
https://github.com/OHDSI/CohortDiagnostics/blob/5f4d80e9f4210ffaf2ac7d4ab2b26102e4987c58/R/IncidenceRates.R#L240

So you will have to skip all these processes, and provide a dummy json to avoid error.

Gowtham_Rao · May 13, 2022, 12:39pm

@OskarGauffin you are welcome to make a proposal for this functionality here Issues · OHDSI/CohortDiagnostics · GitHub

I think it would be useful in many scenarios. For example - it is possible that we take as an input an instantiated cohort table i.e. a table with cohort_definition_id, subject_id, cohort_start_date, cohort_end_date. Can we just point to that cohortTable and runCohortDiagnostics?

I think that would be valuable - as we can get a dashboard (diagnosticsExplorer Shiny app) that

provides cohort counts
cohort overlap
visit context
cohort characterization/temporal characterization including cohort as features which is new in version 3
cohort time series also new in version 3

@jpegilbert is the new maintainer of CohortDiagnostics. So he will have to consider this issue.

jpegilbert · May 16, 2022, 8:47pm

The CohortGenerator package requires JSON for several reasons. Firstly the cohort will have additional meta-data such as inclusion rules that are used to generate statistics and other base information about a Cohort. The output exploration in shiny also currently requires the definition.

As you mentioned before, it is possible to place “Dummy JSON” in the cohort definition set paramter passed to CohortDiagnostics. However, though this is possible the package has never been designed to run with custom SQL because this isn’t seen to be good practice in phenotyping well defined, reusable cohorts.

Alternative to ATLAS, the Capr package can be used to generate cohort definitions in R if you don’t wish to use atlas but wish to create cohorts that conform to the OHDSI json standard, and can be easily exported as OHDSI SQL. This should be useful if the reason you’re using custom SQL is for templating purposes, for example.

Furthermore - If you don’t require the full dashboard for information regarding a Cohort and just want quick characterization results it is possible (and significantly faster) to run FeatureExtraction.

Gowtham_Rao · May 17, 2022, 12:26pm

@jpegilbert I think a use case we should consider supporting is “given an instantiated cohort table with the structure cohort_id, subject_id, cohort_start_date, cohort_end_date as the input provide diagnostics on it.” From technical perspective, we can, even in the absence of cohort json and sql, still do

visit context
index event breakdown
cohort counts
all the characterization that relies on feature extraction
feature cohort characterization that relies on cohort relationship
cohort time series
cohort overlap

these are valuable diagnostics by themselves. Most projects that have scaled up in OHDSI have used template based cohort definitions. I also think there is a use case where we can use the previously generated cohorts as standard feature cohorts.

I am supportive of a technical solution for this. A simple way to do this is allow for cohortDefinitionSet object to have empty cell for JSON and potentially sql. For any cohort that has an empty for these cells - we skip diagnostics on concept set, incidence rate, and cohort generation. We just put this logic in the executeDiagnostics segment