Overlap/duplications between network sites

felixw · November 9, 2021, 1:25pm

Dear Community,

I have a question regarding duplicate data between sites: If, in a network study, different sites have duplicate data how can we deals with that?
By overlap I mean for example, that two sites have data which is partially referring to the same individuals.
Since this duplication clearly would distort data analysis, are there (1) analyses of potentional impact of such a duplication and (2) statistical methods to work around this issue?

Thank you very much

Felix

Christian_Reich · November 10, 2021, 11:10am

@felixw:

I agree with the problem description.

Solution: This can be only overcome by linking data assets together through identifiable information (stored outside the OMOP CDM). Since this is done rarely, and as far as I know only between large commercial databases (Optum claims to Optum EHR, IQVIA claims to IQVIA EHR and Hospital Charge Data Master) we really don’t know the impact in real studies. Just because a patient is in two databases may or may not have an effect on the result of a study, depending on how the cohorts are defined and how data capture between the different databases differs. It is a worthwhile exercise, but not easy.

felixw · November 12, 2021, 6:28am

@Christian_Reich Thank you for the answer.
I realise that it is complicated. I would be specifically interested to know how the EHDEN community deals with this in practice. Is it simply ignored and has never been a point of discussions or is it at least considered in the context of each study in relation to the specific datasets used and research questions, whether there could be an impact and what that impact might be? If, for example, two institutions from the same city provide data for a study, then a biasing effect, depending on the specific analysis, would not necessarily be unlikely. It seems improbable to me that this has never been an issue and has at least been discussed in publications, for example. Or have such constellations simply never occurred so far, so that it really has never been an issue?

Christian_Reich · November 12, 2021, 12:14pm

Oh boy. In Europe, the problem is immediately much larger. Because you’d need the consent of the data owners (the patients) for the data to be passed on for linking purposes. If there is no passing on and the data are already in the same place you can do it, though. If identifiable information are missing people have done linking through comparing the actual anonymized data. But nothing is easy.

The only lasting solution I can see is to create patient-driven repositories of data. The patients can go to the institutions and demand their data, or demand their data to be placed somewhere according to GDPR. But then - you need those patient associations to do that.

prasser · November 14, 2021, 9:04pm

@Christian_Reich Thanks a lot for your answers. I’m jumping in because I think you’re talking a little bit past each other. We (Felix is on my team) are fully aware of the challenges surrounding recording linkage, particularly in Europe (EHDEN is a mistake, we meant to write OHDSI).

Our question is not about how the problem can be solved, but whether the issue and its consequences have already been discussed in the OHDSI community and to what extent.

For example, I could imagine that there might have been critical questions from reviewers on this topic in publications. We looked at some OHDSI publications but could not find any coverage of the topic, for example, in the discussion sections. Another possibility would be that thought is given to this when questions or statistical methods are being selected as part of study planning in OHDSI.

Could you provide us with a little info on this or point us to further material on the topic?