OHDSI Home | Forums | Wiki | Github

Duplicate Diagnosis [THEMIS WG3]

In our claims dataset, we have multiple records that have the same diagnosis codes. The way our claims data comes in, each record is one big gigantic line for each line item on the claim. This results in the same diagnosis is repeated many times, position and record wise. We do have the positions in our dataset as well, however we’ve been told that if it’s the same diagnosis in multiple positions, take the first position. Reason is for many of our analysis, we don’t have to do the de-duplication the same condition codes during the analysis phase.

We also have other EMR datasets that has duplicated diagnosis codes and in those datasets, there is no way to tell if it’s pirmary or secondary.

@MPhilofsky - we use the concept type to help us distinguish between different diagnosis types, like admitting, primary, and secondary.

@ericaVoss For procedures, the rule is based on a combination of fields. If there is a duplicate, then we increment the quantity by 1 therefore we are not losing any information. The combination of fields are person_id, date, procedure code, provider, visit_occurrence_id, and visit_detail_id. If this combination is not met, then we leave them as separate records.

@mvanzandt if you are referring to pharmetrics dataset, the reason you have multiple records with the same diagnosis codes in the same positions is because all of those lines originate from one claim (i.e. the rows have the same claimno value). Most of the time, facility claims are split up this way, where one claim is represented by multiple lines (one per revenue code). When representing this information in the CDM, you should combine those rows together, since they all came from one claim. There shouldn’t be any duplicate diagnoses (ex: diag1 and diag5 have the same values).

Now you run into duplicate diagnosis codes when you are combining professional and facility claims into one admission (i.e. all claim lines are using the same conf_num) because doctors are reporting the same diagnosis that the hospital (facility) is reporting, for billing purposes. In this instance, I think we have something similar to what @MPhilofsky is referring to for her EHR data where you can have the same diagnosis reported for different reasons and professionals. For this claim example, you are getting the same diagnosis code reported by the hospital/facility and one reported by the physician. I would actually keep both of these codes, even though the source code is the same, because the codes come from different sources (one from a facility claim and one from a physician claim). There are multiple published articles that show different algorithms employed for diagnosis reported on inpatient facility claims vs. professional claims (most notably Klabunde et al 2000). Diagnoses from physician claims are treated differently than diagnoses from inpatient facility claims. To show the difference between these two condition source values, I would assign a type_concept_id, as @MPhilofsky suggested (and which we employ locally as well).

@jenniferduryea You are correct that the way our claims dataset works, it’s one record for each line item on one claims.

"There shouldn’t be any duplicate diagnosis (ex: diag1 and diag5 have the same values). Problem is there are. This is why we take the first position. So if you do have diag1 and diag5 with the same value, we take diag1, thus removing duplicates.

@MPhilofsky - good example, I agree this is not a situation where you want to collapse.


@mvanzandt - Very nice, slightly more sophisticated than ours by increment the quantity however otherwise I think we are doing something very similar.


Getting back to the THEMIS recommendation I’m hearing arguments for both “keep all” and “eliminate duplicates” however I’m not sure the best way to describe the “eliminate duplicates” without causing confusion on when it is appropriate.

I am very concerned about de-duplicating diagnoses, especially in claims data, because it seems that people would get very confused as to when to de-duplicate. If you are truly recreating the claim, you should never have the issue where the same diagnosis code is referenced on multiple diagnosis positions (i.e. diag1 and diag5 are both icd9 250.00). I know that large datasets could contain weird data, so I’m not saying it’s an impossibility. But, the possibility of this happening is so rare, it’s negligible.

When people talk about duplicate diagnoses, I’m concerned they are referring to when diagnoses are represented on different claims but refer to the same visit_occurrence_id. And that is a problem, because the diagnoses are not exactly duplicate. The code comes from a different claim (and hence, a different provider) and should have a different type_concept_id. Diagnoses coming from different providers carry different certainty (i.e. inpatient facility diagnosis have more weight than physician claim diagnoses). This is similar to @MPhilofsky’s example of having the chief complaint, billing diagnosis, and encounter diagnosis refer to the same visit_occurrence_id but have the same source code. They are not duplicates. Claims data has the same issue, but people have to be very aware of the type of claim the diagnosis comes from to see they are not the same.

I suggest you keep all diagnoses in and use type_concept_ids to filter out what you want.

So, could we say a ‘unique diagnosis’ is a composite of the diagnosis code (condition_concept_id), and the condition_type_concept_id? and we could de-dupe based on that? I think I understand that it’s supposed to be rare, but sometimes I feel like iIm living in the corner-cases.

-Chris

@Chris_Knoll It gets even more confusing because the data providers make this mistake too, and propagate it in the data they sell.
I should be careful not to blame data providers specifically. But somewhere in the path there is a misunderstanding and users get the data with the errors already in place.

1 Like

If we are using type_concept_ids 38000183-45756855, then that could work. But then I pose another question to the group - what if you have two different condition_occurrence.associated_provider_id records reporting the condition records with the same condition_concept_id and condition_type_concept_id? Do you want to preserve who the reporting physicians are? Another question is, what if the condition_start_date/condition_end_dates are different between the two condition_occurrence records (with the same concept_id and type_concept_id). Sure, these conditions will have the same visit_occurrence_id, but the associated dates for conditions that arise during a long inpatient stay might be relevant to analysis. And that would be reported on associated physician claims.

I would only de-dupe, if all variables (i.e. values in the table columns) in the condition_occurrence records are the same - except for the condition_occurrence_id. Which does seem like a rare occurrence - especially if using type_concept_ids 38000183-45756855. It doesn’t seem worth the effort to try to de-dup.

I would say that we’d want to know the provider because some cases the provider specialty (like dermatologist for skin conditions) is considered when selecting the diagnosis. So to maintain that, you’d have a row with the same diagnosis codes and type, but with different provider. We could say this isn’t a dupe in this case, and add provider ID to the ‘distinguishing factors’

I always thought as a diagnosis as a ‘moment in time’ and not something that would have a duration, but the CDM calls for a start and end of the condition, so maybe it’s up to the ETL to collapse these things into a continuous duration of the condition, but you run into issue when multiple providers are in the mix (there’s only 1 provider per condition occurrence record).

I think you bring up a lot of great points that probably are only a consideration during the design of a specific study/analysis, so perhaps it makes sense to leave the granular details in the tables and let the researcher decide how to ‘roll-up’ multiple records into a single event.

Part of my interest int his conversation is how CIRCE cohort expressions are built. On one hand, I want to try to enforce some rules so that the cohort criteria has expected results. On the other hand, there isn’t a one-size-fits-all solution for all cases. What I’m leaning to, based on your feedback, @jenniferduryea, is that I think I have to make the condition records selected by the criteria support settings to allow a ‘group by’ based on just the start/end, provider, condition_type, etc, and leave it to the research to decide what constitutes a distinct diagnosis for their research question. The devil is in the details, tho, so I don’t have a specific implementation in mind, but this information is very helpful.

-Chris

1 Like

I cannot agree MORE with this statement. YES to this!

US health claims, are mostly like this https://github.com/OHDSI/Vocabulary-v5.0/issues/156 Ofcourse, data provisioners may transform data in any form they want. Inpatient vs outpatient indicators are a derived information, obtained thru calculation by data providers.

Instead of condition_type, I think we should focus on enhancing visit_type_concept_id. For US claims, I think the Themis convention for condition_type_concept_id should be increment on top of visit_type_concept_id, to say if the condition is admitting dx, primary dx, or secondary dx. The actual position 2nd, 3rd etc is not important and mostly arbitrary. The relative numeric rank among secondary dx does not represent acuity or importance of diagnosis. Don’t mix place of service with condition_type

I don’t see value in keeping duplicates among secondary dx, but I think primary, secondary and admitting dx should not be deduplicating across them. Duplicates may be a data quality consideration fir Achilles heel. Leave deduplicating on the analytic side, not ETL side.

Recommended enhancements to visit_tyoe_concept_id and reducing the complexity of condition_type_concept_id

1 Like

I’m not following you, @Gowtham_Rao. This is one visit from our EHR record. How do you suggest we enhance the visit_type_concept_id concept set to differentiate between billing and encounter diagnoses?

Looks like I really stepped in it with this THEMIS recommendation. :sweat_smile: Great points made.

While I haven’t discussed with THEMIS WG3 I may propose to back out this recommendation. I think @jenniferduryea has highlighted for me this recommendation might lead to more harmful behavior than good and @MPhilofsky example is a shining example of where this recommendation is dangerous. I think @Gowtham_Rao says it best here . . .

I’m still interested in if others have feedback and I will take this thread to our next THEMIS WG3 meeting.

1 Like

We need two visit_type_concept_id’s that are descendants of 44818518 Visit derived from EHR record http://www.ohdsi.org/web/atlas/#/concept/44818518

Visit derived from EHR billing record
Visit derived from EHR encounter record

Note: hierarchy among _type_concept_id
I have updated the issue request here https://github.com/OHDSI/Vocabulary-v5.0/issues/156

@Chris_Knoll this makes https://github.com/OHDSI/Atlas/issues/521 important - because we are adding descendant/hierachy to _type_concept_id

@Gowtham_Rao,

We don’t want to create two visit records from one encounter. We only have one visit. We want to leave the distinction as close to the source as possible. We want to distinguish the source of the condition, either it came from a billing table or it came from an encounter table.

Yup - the construct of the visit is complex. Generally a visit is defined as the unique combination of person_id, visit_start_date, care_site_id.

What I have seen as being consistently re-deliberated is – whether this definition of visit should be handled at the ETL time or the analytic time. I believe it should be at the analytic time, and the ETL should have as much provenance to the source data as possible - keep record level referential integrity. Otherwise, the assumption made during ETL will propagate to all downstream analysis - make it difficult to generalize the findings, as the assumptions are not overtly stated.

In your particular case, I dont know the answer – because it depends on how the source system is handling the data. If it is data from two different source systems for the same person (i.e. billing system and encounter system), then I think it should be two different records in the visit_occurrence; because that will allow for lineage to the source.

1 Like

I agree with this completely. However, I know that others have clearly stated that OMOP is an analysis data model, where these decisions should be baked into the ETL. It would be good to get a definitive opinion on this.

On a related note, there needs to be an unambiguous definition of a visit. You could argue that provider needs to be part of the definition. I believe THEMIS is working on this?

1 Like

Friends:

Apart from the Visit problem: Did you end up with a compromise with the deduping? Or a recommendation? Did you collect USE CASES?

@Christian_Reich - I think we landed on that we shouldn’t make any recommendation with deduping as the potential for harm is greater than the small gains that you could make in cleaning up the duplication.

Let me be more clear the conclusion to this thread:

We will not be making a recommendation to deduplicating diagnosis. We believe the opportunity for error is higher than what would be gained.

3 Likes
t