Summary of output generated by CohortDiagnostics

peter.prinsen · July 5, 2023, 12:00pm

We have data in the OMOP-CDM and we are participating in studies. In order to participate in a study we need to do a data request. So far, we have been doing that based on the research protocol but I have noticed that in many studies we need to share CohortDiagnostics results. It seems CohortDiagnostics creates more (aggregated) data than is typically described in the protocol. So I need to add this to the data request. I was looking at the csv files that CohortDiagnostics creates and I was trying to make sense of them. A lot of it is clear but there are some files that I don’t understand. I am sure I am not the only one so my question is: Is there a description of the output somewhere (I have not been able to find it)?

Gowtham_Rao · July 5, 2023, 1:55pm

Thank you @peter.prinsen . I think you are touching on two ideas.

What is the output i.e. the tables and its data structure.
What is it for? i.e. the idea of ‘Diagnostics’.

Here 1 is just the data model of output. 2 is how the data in the data model is joined together to make meaningful sense e.g. ‘Visit context’ diagnostics.

For 1 - the documentation is the this. This should be the same as the output files you see.

For 2 - the key stone is the idea of ‘diagnostics’. A study protocol could state what diagnostics it is running. By default, a large set of diagnostics are pre-selected in CohortDiagnostics as shown here . A study may decided to change it of course. How to interpret these diagnostics in a study is written here

peter.prinsen · July 10, 2023, 9:42am

@Gowtham_Rao, thank you for the references and your quick response. This was very useful! The data model specification is great as an overview. It would help though if there was also a description of what each of the columns mean (since I need to explain all that in the data request/IRB approval). Most are straightforward but I don’t understand the following columns:

cohort_inc_result: inclusion_rule_mask, mode_id
cohort_inc_stat: rule_sequence, person_count, person_total, gain_count, mode_id
cohort_summary_stats: base_count, final_count, mode_id

I was also wondering what information the vocabulary files contain (is_vocabulary_table = yes in the data model specification). It is not our complete vocabulary (the files are too small). Is it all the vocabulary information that is relevant for the diagnostics (in which case it does not contain personal data) or is it all the vocabulary information that is relevant for the diagnostics but only limited to what we use to code our data (in which case it does contain information about our data so I need to describe it)?

jpegilbert · July 10, 2023, 4:02pm

Hi @peter.prinsen, to try and answer your queries

I don’t understand the following columns:

cohort_inc_result: inclusion_rule_mask, mode_id

cohort_inc_stat: rule_sequence, person_count, person_total, gain_count, mode_id

cohort_summary_stats: base_count, final_count, mode_id

These tables are used by the cohort generator package for cohort inclusion rules. Each cohort definition can have any number of inclusion rules applied to the cohort. For example, you could restrict cohort entry to only those patients experiencing a drug, condition, visit type etc.

When cohorts are generated the statistics for how many patients are gained or lost as a result of these.
The masking rules relate to each individual inclusion rule and allows the calculation of a full attrition table. Some of our newer packages actually display this attrition visualization and it should make its way in to CohortDiagnostics in a future version. However, for now these are just displayed in raw numbers in the Cohort counts section of the Diagnostics Explorer shiny app.

I was also wondering what information the vocabulary files contain (is_vocabulary_table = yes in the data model specification).

The vocabulary tables in question are for concept set analysis. The concepts stored are those that are found in any concepts used within the study and the tables exported are therefore subsets of:

concept
concept_ancestor
concept_synonym
vocabulary (which is used to see which vocabulary version the CDM is on which can help with uncovering inconsistencies)

In principle these contain no patient level information as there will be only a single record if the concept actually occurs.

However, as you say this only relates to the concept sets in your study but this is really just a concept name field and meta-data for data collected elsewhere (e.g. there are already concept counts and

peter.prinsen · August 25, 2023, 10:05am

Hi @jpegilbert

Thank you for your answer, this is very helpful! I thought the study package we were running only had CohortDiagnostics results but they include CohortGenerator results as well. To get permission to share the data I need to explain in a bit more detail how the files are generated. So I have a few follow up questions about the columns in the files that I hoped you could answer:

In cohortIncResult.csv:

inclusionRuleMask: binary mask that indicates which inclusion rules are applied for this specific count (please let me know if this is not correct)
modeId: what does this mean?

In cohortIncStat.csv:

ruleSequence: ???
personCount: ???
gainCount: ???
personTotal: ???
modeId: ???

I get the feeling that this is something like count with all inclusion rules (personTotal) and count with a certain subset (subset = ruleSequence and count = personCount) and the difference between the two (gainCount) but maybe you could clarify?

In cohortSummaryStats.csv:

baseCount: ???
finalCount: ???
modeId: ???

There is also a cohortCensorStats.csv and cohortInclusion.csv file but these are empty for this particular study. Could you explain what they contain as well?

Thanks again for any help you could provide. We are running more and more studies for external partners so we are trying to standardize the data requests in our organization. Since CohortDiagnostics (and CohortGenerator) are packages that are often run I am creating documents explaining the output they generate for the people that handle the data requests. For CohortDiagnostics I was able to do most of that from the output files and the package code. I am having a harder time understanding the output of CohortGenerator and the code of that package.

peter.prinsen · September 5, 2023, 3:19pm

I did some experimenting with CohortGenerator and Atlas and I think I answered my own questions. I am posting the answers below for others with the same questions:

In cohortIncResult.csv:

inclusionRuleMask:
binary representation of inclusion rule passing/failing:
0 = failed
1 = passed
For example, if there are three inclusion rules then 100 means passing of the 3rd inclusion rule (as defined in cohortInclusion.csv) and failing of the 1st and 2nd.
modeId:
0 = all events: all cohort entry events for a person are included in the calculation
1 = best event: only one cohort entry event for each person is used in the calculation, the so-called ‘best event’. If a person has multiple cohort entry events then the best event is the event that satisfies the most inclusion rules. If there is a tie then from those that tied, the one that matches the earliest inclusion rule is picked. If there is still a tie then the earliest event is chosen from those that are tied.
See circe-be/src/main/resources/resources/cohortdefinition/sql/generateCohort.sql

In cohortIncStat.csv:

ruleSequence: specific inclusion rule as defined in cohortInclusion.csv
personCount: number of persons with the cohort entry event that pass the inclusion rule in ruleSequence
gainCount: number of persons in the cohort when all inclusion rules are applied except the one in ruleSequence minus the number of persons in the cohort when all inclusion rules are applied (including the one in ruleSequence)
personTotal: number of persons with the cohort entry event
modeId: see above

In cohortSummaryStats.csv:

baseCount: number of persons with the cohort entry event
finalCount: number of persons with the cohort entry event that pass all the inclusion rules
modeId: see above

Gowtham_Rao · September 5, 2023, 7:41pm

@peter.prinsen i did a similar post here yesterday