How to debug cohort generation for Truven Marketscan Medicare data?

jaan · February 28, 2022, 9:17pm

Expected behavior

more than 1 person in a large cohort

Actual behavior

1 person in a cohort

Steps to reproduce behavior

Use Atlas to generate this cohort:
delivery-hospitalizations.txt
On the Truven Marketscan Medicare data source:

Note that there are 100k+ people in the other data sources, but only a single person in Medicare.

How would you debug this?

Any help would be appreciated. @toekneesunshine and I looked at the exported SQL but it is messy, and all the concept sets are compressed into a single line, so something like a stack trace or binary search to comment out parts of the query will be difficult. Has anyone debugged errors like this that are data source dependent?

Alternatively, is there a point person at Truven to ask about potential preprocessing in the data that is different than the Multi-state Medicaid data, leading to only a single person in this cohort?

Thanks!
Jaan

cc @karthik, @t_abdul_basser, @noemie

Patrick_Ryan · March 1, 2022, 2:46am

Hi @jaan , it looks like your cohort is looking for pregnant women. Note the ibm marketscan medicare database only contains retirees who opt for supplemental medicare coverage, so this is almost all >65 years old, so we wouldnt expect to see pregnant women in there.

jaan · March 1, 2022, 2:18pm

Thanks @Patrick_Ryan – that’s good to know!

That is very confusing because there is another cohort generated on the same data with 26k pregnancy-related outcomes (severe maternal morbidity).

Cohort for severe maternal morbidity with delivery hospitalization:
severe-maternal-morbidity-json.txt

There are 26k people who match this criteria on the Truven Medicare data:

Any advice on how to compute the diff between these two cohorts to find what the bug is?

Patrick_Ryan · March 1, 2022, 2:35pm

Thanks for sharing your JSON, it makes it easy to diagnosis what may be going on. A quick look, I think it has to do with your first entry event: ‘SMM 17-21 procedure indicators’, which appears to have some codes that wouldn’t be exclusive to a pregnancy episode, and these procedures aren’t being restricted to either an inpatient visit or requiring a condition indicating pregnancy.

jaan · March 15, 2022, 8:49pm

Thanks @Patrick_Ryan – why would these codes be different across Truven Medicare, Multi-State Medicaid, CCAE and CUMC data?

Patrick_Ryan · March 16, 2022, 12:43pm

MDCD, MDCR, and CCAE will likely have similar diagnosis codes, but the impact on pregnancy-related codes will be quite different (since MDCR generally shouldn’t have pregnancies).

CUIMC will have different codes, given that its EHR and not restricted only to ICD-based billing codes.

jaan · March 16, 2022, 12:50pm

Thank you @Patrick_Ryan that is very helpful.

Would a histogram of codes in the cohort definition across MDCD, MDCR, CCAE, and CUIMC help understand whether the pregnancy-related codes or the lack of ICD-based billing codes in some data sources is the issue?

jaan · March 16, 2022, 12:51pm

I am also wondering whether looking at the ETL pipeline for these data sources, to see how the data is loaded into Atlas could help understand why MDCR only has a single patient compared to the others.