OHDSI Home | Forums | Wiki | Github

Worldwide datathon with french hospitals dataset

(Adrien Parrot) #1

Do you want to work with real-life OMOP dataset ?
InterCHU group of French hospitals is launching a wordwide 2-month datathon!
From synthetic dataset in OMOP format (1000 patients from synPUF) you will create an algo to manage duplicate patients that is then tested on real-life dataset from partner hospitals.
Here the website : interchu.frama.io/website
Here the twiiter account : https://twitter.com/interchu

(Seng Chan You) #2

Wow, this is awesome!! @Adrien_PARROT

(Selva Muthu Kumaran Sathappan) #3


I am not sure whether I got the task right. You want us to write an algorithm/code that can help us identify duplicate patient from a dataset? Is that all? Looks like I am missing something here.

(Adrien Parrot) #4

Yes that’s all! You think it’s too easy?
Hope to see you on our forum to register you.

(belay birlie) #5

Great challenge! Thanks for sharing. I hope, a combination of exact and probabilistic matching will do the job.

(Selva Muthu Kumaran Sathappan) #6


Can you provide an example scenario? Usually patients have ID to identify them. But if you are going to use Name, can you provide us an example on what you actually expect so that it would be easy to understand and code for us? I did refer your forum and register but still require some examples to understand

(Dmytry Dymshyts) #7

the case is: “the patient is not well identified (name misprint, human error…)”
So it can be, let’s say
Selva Muthu Kumaran Sathappan
Silva Muthu Kumaran Satapan
sorry for misspelling your name :slight_smile:
Personaly, I used to it, my name can be Dmitry, Dimitry, Dmytry, Dima, Dmytro (the last one is my name in passport)
and then the system treats these as two separate patients.
And then you deal with de-identified data, so you can’t run some string matching by names.

At least it’s how I understand the case from its description, need to see the actual data though.

(Adrien Parrot) #8

Sorry it is not clear.

Thanks. You’re absolutly right. But there is an additional difficulty for analysts. We use a synthetic data set (SynPUF) and there are no duplicate patient in this dataset. It only allows participants who do not know OMOP to discover the model more easily.

I will give you another example.
Sometimes the dates and months are reversed. Thus the date of birth is written at 02/01/1973 while the patient was born on 01/02/1973. Potentially several patients can be generated when there is only one.

As @Dymshyts says, some databases do not have the identifying information present (this is the case for the centralized French healthcare system called SNDS). We need to invent a way to find duplicate patients with procedures or diagnoses.

(Christian Reich) #9

This is the key, @Adrien_PARROT. And I only realized that when I was told by the folks from Lyon. This is matching of patients not by identifiable information (name, dob, some person ID like the Social Security Number), telephone number), but by the clinical data. This is actually cool. Are you looking for some probabilistic method, or heuristic? Are you going to use the concept hierarchy to help you with that?

(Adrien Parrot) #10

Yes there are two challenges in one

  • identity duplicate patients by identifiable information (hospitals)
  • identify duplicate patients not by identifiable information (SNDS)

Actually the participants will choose their method and let the best win !!! :slight_smile: It is possible to use concept hierarchy.

(Christian Reich) #11

The best should win? Where did you get that silly idea from? :slight_smile:

This is cool. Please make this transparent and Github it, so people understand the challenges and the solutions.

(Adrien Parrot) #12

The best should win but there are nothing to win. Only glory :slight_smile:
I think this is transparent on the website and on the git.
Is it good with you?