Worldwide datathon with french hospitals dataset

Adrien_PARROT · July 12, 2019, 9:47am

Do you want to work with real-life OMOP dataset ?
InterCHU group of French hospitals is launching a wordwide 2-month datathon!
From synthetic dataset in OMOP format (1000 patients from synPUF) you will create an algo to manage duplicate patients that is then tested on real-life dataset from partner hospitals.
Here the website : interchu.frama.io/website
Here the twiiter account : https://twitter.com/interchu

SCYou · July 12, 2019, 3:02pm

Wow, this is awesome!! @Adrien_PARROT

SELVA_MUTHU_KUMARAN · July 23, 2019, 2:51am

Hello

I am not sure whether I got the task right. You want us to write an algorithm/code that can help us identify duplicate patient from a dataset? Is that all? Looks like I am missing something here.

Adrien_PARROT · July 23, 2019, 11:58am

Hi.
Yes that’s all! You think it’s too easy?
Hope to see you on our forum to register you.

belay · July 24, 2019, 8:18am

Great challenge! Thanks for sharing. I hope, a combination of exact and probabilistic matching will do the job.

SELVA_MUTHU_KUMARAN · July 24, 2019, 8:56am

Hi,

Can you provide an example scenario? Usually patients have ID to identify them. But if you are going to use Name, can you provide us an example on what you actually expect so that it would be easy to understand and code for us? I did refer your forum and register but still require some examples to understand

Dymshyts · July 24, 2019, 10:08am

See
the case is: “the patient is not well identified (name misprint, human error…)”
So it can be, let’s say
Selva Muthu Kumaran Sathappan
and
Silva Muthu Kumaran Satapan
sorry for misspelling your name
Personaly, I used to it, my name can be Dmitry, Dimitry, Dmytry, Dima, Dmytro (the last one is my name in passport)
and then the system treats these as two separate patients.
And then you deal with de-identified data, so you can’t run some string matching by names.

At least it’s how I understand the case from its description, need to see the actual data though.

Adrien_PARROT · July 24, 2019, 12:12pm

@SELVA_MUTHU_KUMARAN
Sorry it is not clear.

@Dymshyts
Thanks. You’re absolutly right. But there is an additional difficulty for analysts. We use a synthetic data set (SynPUF) and there are no duplicate patient in this dataset. It only allows participants who do not know OMOP to discover the model more easily.

I will give you another example.
Sometimes the dates and months are reversed. Thus the date of birth is written at 02/01/1973 while the patient was born on 01/02/1973. Potentially several patients can be generated when there is only one.

As @Dymshyts says, some databases do not have the identifying information present (this is the case for the centralized French healthcare system called SNDS). We need to invent a way to find duplicate patients with procedures or diagnoses.

Christian_Reich · July 26, 2019, 1:16pm

This is the key, @Adrien_PARROT. And I only realized that when I was told by the folks from Lyon. This is matching of patients not by identifiable information (name, dob, some person ID like the Social Security Number), telephone number), but by the clinical data. This is actually cool. Are you looking for some probabilistic method, or heuristic? Are you going to use the concept hierarchy to help you with that?

Adrien_PARROT · July 26, 2019, 2:44pm

Yes there are two challenges in one

identity duplicate patients by identifiable information (hospitals)
identify duplicate patients not by identifiable information (SNDS)

Actually the participants will choose their method and let the best win !!! It is possible to use concept hierarchy.

Christian_Reich · July 26, 2019, 2:49pm

The best should win? Where did you get that silly idea from?

This is cool. Please make this transparent and Github it, so people understand the challenges and the solutions.

Adrien_PARROT · July 28, 2019, 7:29am

The best should win but there are nothing to win. Only glory
I think this is transparent on the website and on the git.
Is it good with you?