OHDSI Home | Forums | Wiki | Github

Duplicate Diagnosis [THEMIS WG3]

One of the items that came out of the OHDSI Symposium Themis Working Group meeting was how if multiple diagnosis come across on the same claim or visit.

We would like to recommend the following, not as a rule but as a recommendation. We would like to propose adding this statement to the CDM wiki on the CONDITION_OCCURRENCE page on the CDM Wiki under conventions.

In claims data often multiple records for the same diagnosis can exist in different priority (e.g. ICD9 250 is the primary diagnosis as well as the 5th diagnosis - this can happen due to coding practices or multiple claims associated to one visit that are out of sync in terms of diagnosis priority). It is recommended, not required, to select the highest priority for the condition and eliminate the duplicate (e.g. ICD9 250 as a primary diagnosis instead of the 5th diagnosis).

We are interested in your feedback and thoughts.

Same topic came up with procedures in themis WG 1. In data that lack hours and minutes granularity, what you do with 2 instances of an event (e.g., resuscitation) in the same date. Do you keep two rows identical rows (if they share all possible values - even same provider_id) or do you collapse. We came up with some similar assumptions for procedures.

What benefit would deduplicating provide? Technical performance gain should be minimal, because I don’t think this a highly prevalent problem. Methodical performance gain - I don’t see any important reason why we would want to count total number of times a person would have a diagnosis code; I do see value for number of claim/visit records with the diagnosis code.

What are the risks of having duplicated diagnosis code for same record, that this convention would avoid.

1 Like

@Vojtech_Huser but I would think of procedures differently. You can have many of the same procedure in the same day. When you are getting chemo, you may have multiple port flushes in a day, those could be individual procedures with a charge associated to them. However getting diagnosed with lymphoma twice in one day doesn’t mean you have more lymphoma.


We wanted to pose this as a recommendation not as a rule. However you pose a good question here . . . I think this came up during THEMIS as someone looking for a recommendation of what should be done in this situation. Several sites mentioned they do this and some of our data vendors do this as well.

@mvanzandt or @jenniferduryea did you mention that your CDMs do this? I know we do it but I feel we mostly do it to clean up the data and remove inconsistencies (e.g. diagnosis is PDX and DX5 in same visit).

Looking forward to more conversation on this . . . please keep the thoughts and questions coming . . .

If we could standardize on deduplication, then we could have a standardized meaning when we say: have at least 3 diagnoses after exposure. Without deduplication, those 3 could all occur on the same day/visit. if we dedupe to one single diagnosis per day or visit, then we can rely on counting things after diagnosis occurred at different days or visits, for example.

What is the example dataset that you found duplicate diagnoses in?

I have looked at Marketscan and Optum and find that I only run into issues with duplicate diagnoses when I’m ETL’ing from a table that “summarizes” multiple claims together on one line. So you will get duplicate diagnoses because of the “summarization” process by the data vendor. For example, in Marketscan, I only see duplicate diagnosis codes in the Admissions table, which creates one record for each patient’s inpatient stay. A patient’s inpatient stay could contain multiple claims records. So when Truven summarizes the claims into one record in the Admission table, they can put in the same diagnosis code (actually coming from two different claims) into the one Admission table. This duplicate diagnosis issue is a construct created by the data vendor - not what is actually being reported on the medical claim itself. Physicians/hospitals cannot put duplicate diagnosis codes on their claims (FYI).

What we have done is ETL the data at the most granular level. And then look at summary tables for extra diagnoses that we might have missed and assign them to the facility claim.

@ericaVoss your example of pdx and dx5 is only relevant for facility claims in the S table for Marketscan. Truven takes the primary diagnosis (dx1) for the main facility claim in the S table by case_ID and applies that same pdx value to all claims with the same case_id. The pdx is an imputed variable. That pdx diagnosis actually came from one claim - the facility claim. We do not ETL the pdx for the physician claims. So that also decreases the duplication to minimal if not non-existent.

1 Like

Agree this doesn’t happen very often and it is the scenario you describe, the details do not match or the details and the summary do not match. I think we initially noticed it in Truven but it also occurred in Optum. Additionally Optum Extended is now preprocessing the data to eliminate this problem further; if the claim details do not agree, they select the order from the first line.

This came up in THEMIS as something people wanted to have a recommendation on. This isn’t a meant to be a rule. Given this feedback above, maybe we should soften it?

In claims multiple records for the same diagnosis can exist in different priority (e.g. ICD9 250 is the primary diagnosis as well as the 5th diagnosis). A CDM ETL can choose to keep all diagnoses or eliminate duplicates by the highest priority one (e.g. ICD9 250 as a primary diagnosis and eliminate the duplicate of the 5th diagnosis).

In our EHR data we also have multiple records for the same or slightly different diagnosis. “chief complaint”, “billing diagnosis”, and “encounter diagnosis” are the main types of provenance for current conditions.
We are going to keep all the records for our research purposes. I believe the best way to differentiate the different sources for condition_concept_ids is through the condition_type_concept_id field. Chief complaint is represented as concept_id = 42894222 “EHR Chief Complaint”. However, billing diagnosis and encounter diagnosis do not have a matching standard concept in the condition type concept domain.

I propose 2 more condition_type_concept_ids be created.

  1. billing diagnosis
  2. encounter diagnosis

I’m not very familiar with claims data, but would the type_concept_id field help differentiate the different sources for the claims data? There are values for the multiple different positions : Outpatient detail - 5th position, Carrier claim detail - 1st position, Inpatient detail - primary, etc.

In our claims dataset, we have multiple records that have the same diagnosis codes. The way our claims data comes in, each record is one big gigantic line for each line item on the claim. This results in the same diagnosis is repeated many times, position and record wise. We do have the positions in our dataset as well, however we’ve been told that if it’s the same diagnosis in multiple positions, take the first position. Reason is for many of our analysis, we don’t have to do the de-duplication the same condition codes during the analysis phase.

We also have other EMR datasets that has duplicated diagnosis codes and in those datasets, there is no way to tell if it’s pirmary or secondary.

@MPhilofsky - we use the concept type to help us distinguish between different diagnosis types, like admitting, primary, and secondary.

@ericaVoss For procedures, the rule is based on a combination of fields. If there is a duplicate, then we increment the quantity by 1 therefore we are not losing any information. The combination of fields are person_id, date, procedure code, provider, visit_occurrence_id, and visit_detail_id. If this combination is not met, then we leave them as separate records.

@mvanzandt if you are referring to pharmetrics dataset, the reason you have multiple records with the same diagnosis codes in the same positions is because all of those lines originate from one claim (i.e. the rows have the same claimno value). Most of the time, facility claims are split up this way, where one claim is represented by multiple lines (one per revenue code). When representing this information in the CDM, you should combine those rows together, since they all came from one claim. There shouldn’t be any duplicate diagnoses (ex: diag1 and diag5 have the same values).

Now you run into duplicate diagnosis codes when you are combining professional and facility claims into one admission (i.e. all claim lines are using the same conf_num) because doctors are reporting the same diagnosis that the hospital (facility) is reporting, for billing purposes. In this instance, I think we have something similar to what @MPhilofsky is referring to for her EHR data where you can have the same diagnosis reported for different reasons and professionals. For this claim example, you are getting the same diagnosis code reported by the hospital/facility and one reported by the physician. I would actually keep both of these codes, even though the source code is the same, because the codes come from different sources (one from a facility claim and one from a physician claim). There are multiple published articles that show different algorithms employed for diagnosis reported on inpatient facility claims vs. professional claims (most notably Klabunde et al 2000). Diagnoses from physician claims are treated differently than diagnoses from inpatient facility claims. To show the difference between these two condition source values, I would assign a type_concept_id, as @MPhilofsky suggested (and which we employ locally as well).

@jenniferduryea You are correct that the way our claims dataset works, it’s one record for each line item on one claims.

"There shouldn’t be any duplicate diagnosis (ex: diag1 and diag5 have the same values). Problem is there are. This is why we take the first position. So if you do have diag1 and diag5 with the same value, we take diag1, thus removing duplicates.

@MPhilofsky - good example, I agree this is not a situation where you want to collapse.


@mvanzandt - Very nice, slightly more sophisticated than ours by increment the quantity however otherwise I think we are doing something very similar.


Getting back to the THEMIS recommendation I’m hearing arguments for both “keep all” and “eliminate duplicates” however I’m not sure the best way to describe the “eliminate duplicates” without causing confusion on when it is appropriate.

I am very concerned about de-duplicating diagnoses, especially in claims data, because it seems that people would get very confused as to when to de-duplicate. If you are truly recreating the claim, you should never have the issue where the same diagnosis code is referenced on multiple diagnosis positions (i.e. diag1 and diag5 are both icd9 250.00). I know that large datasets could contain weird data, so I’m not saying it’s an impossibility. But, the possibility of this happening is so rare, it’s negligible.

When people talk about duplicate diagnoses, I’m concerned they are referring to when diagnoses are represented on different claims but refer to the same visit_occurrence_id. And that is a problem, because the diagnoses are not exactly duplicate. The code comes from a different claim (and hence, a different provider) and should have a different type_concept_id. Diagnoses coming from different providers carry different certainty (i.e. inpatient facility diagnosis have more weight than physician claim diagnoses). This is similar to @MPhilofsky’s example of having the chief complaint, billing diagnosis, and encounter diagnosis refer to the same visit_occurrence_id but have the same source code. They are not duplicates. Claims data has the same issue, but people have to be very aware of the type of claim the diagnosis comes from to see they are not the same.

I suggest you keep all diagnoses in and use type_concept_ids to filter out what you want.

So, could we say a ‘unique diagnosis’ is a composite of the diagnosis code (condition_concept_id), and the condition_type_concept_id? and we could de-dupe based on that? I think I understand that it’s supposed to be rare, but sometimes I feel like iIm living in the corner-cases.

-Chris

@Chris_Knoll It gets even more confusing because the data providers make this mistake too, and propagate it in the data they sell.
I should be careful not to blame data providers specifically. But somewhere in the path there is a misunderstanding and users get the data with the errors already in place.

1 Like

If we are using type_concept_ids 38000183-45756855, then that could work. But then I pose another question to the group - what if you have two different condition_occurrence.associated_provider_id records reporting the condition records with the same condition_concept_id and condition_type_concept_id? Do you want to preserve who the reporting physicians are? Another question is, what if the condition_start_date/condition_end_dates are different between the two condition_occurrence records (with the same concept_id and type_concept_id). Sure, these conditions will have the same visit_occurrence_id, but the associated dates for conditions that arise during a long inpatient stay might be relevant to analysis. And that would be reported on associated physician claims.

I would only de-dupe, if all variables (i.e. values in the table columns) in the condition_occurrence records are the same - except for the condition_occurrence_id. Which does seem like a rare occurrence - especially if using type_concept_ids 38000183-45756855. It doesn’t seem worth the effort to try to de-dup.

I would say that we’d want to know the provider because some cases the provider specialty (like dermatologist for skin conditions) is considered when selecting the diagnosis. So to maintain that, you’d have a row with the same diagnosis codes and type, but with different provider. We could say this isn’t a dupe in this case, and add provider ID to the ‘distinguishing factors’

I always thought as a diagnosis as a ‘moment in time’ and not something that would have a duration, but the CDM calls for a start and end of the condition, so maybe it’s up to the ETL to collapse these things into a continuous duration of the condition, but you run into issue when multiple providers are in the mix (there’s only 1 provider per condition occurrence record).

I think you bring up a lot of great points that probably are only a consideration during the design of a specific study/analysis, so perhaps it makes sense to leave the granular details in the tables and let the researcher decide how to ‘roll-up’ multiple records into a single event.

Part of my interest int his conversation is how CIRCE cohort expressions are built. On one hand, I want to try to enforce some rules so that the cohort criteria has expected results. On the other hand, there isn’t a one-size-fits-all solution for all cases. What I’m leaning to, based on your feedback, @jenniferduryea, is that I think I have to make the condition records selected by the criteria support settings to allow a ‘group by’ based on just the start/end, provider, condition_type, etc, and leave it to the research to decide what constitutes a distinct diagnosis for their research question. The devil is in the details, tho, so I don’t have a specific implementation in mind, but this information is very helpful.

-Chris

1 Like

I cannot agree MORE with this statement. YES to this!

US health claims, are mostly like this https://github.com/OHDSI/Vocabulary-v5.0/issues/156 Ofcourse, data provisioners may transform data in any form they want. Inpatient vs outpatient indicators are a derived information, obtained thru calculation by data providers.

Instead of condition_type, I think we should focus on enhancing visit_type_concept_id. For US claims, I think the Themis convention for condition_type_concept_id should be increment on top of visit_type_concept_id, to say if the condition is admitting dx, primary dx, or secondary dx. The actual position 2nd, 3rd etc is not important and mostly arbitrary. The relative numeric rank among secondary dx does not represent acuity or importance of diagnosis. Don’t mix place of service with condition_type

I don’t see value in keeping duplicates among secondary dx, but I think primary, secondary and admitting dx should not be deduplicating across them. Duplicates may be a data quality consideration fir Achilles heel. Leave deduplicating on the analytic side, not ETL side.

Recommended enhancements to visit_tyoe_concept_id and reducing the complexity of condition_type_concept_id

1 Like

I’m not following you, @Gowtham_Rao. This is one visit from our EHR record. How do you suggest we enhance the visit_type_concept_id concept set to differentiate between billing and encounter diagnoses?

t