How do you represent test patients in the CDM at your site?

Vojtech_Huser · December 24, 2014, 2:51pm

It is always helpful if a query produces data that are
guaranteed to be free of protected health information. Many EHR-based
(non-claims) warehouses contain test patients and some have some way of tagging
test patients. Test patients are also useful for knowing how facts on the
screen in an EHR end up showing in the warehouse. (see the ETL impact in the
data)

Perhaps CDM should somehow allow tagging certain patients (person_id’s) as test patients.

If the same query across CDM sites always displays dummy
test patient data, it may be useful.

We should not put meaning into an ID, but one very simple option is that
person_id = 0 is always a test patient. (we already use a vocabulary_id = 0 trick
once) (I think this approach is not the best and would allow only 1 test patient)

A more flexible way might be to define a “test-patient” location in the LOCATION table (perhaps with
location_id=0) and assigning this test-patient location to test patients (this would allow to have multiple test patients (we have many unreal, dummy, test patients in our NIH CC BTRIS warehouse). In the CDM - the table PERSON table
has a column location_id that refers to the LOCATION table).

(The most “clean” approach would obviously be a test patient flag in the schema (PERSON table) but that requires schema change rather than just a new “CDM convention rule”.)

Do you currently allow test patients into your CDM data view?
Do you see value in keeping them in the dataset?
Do you think we should have the same way of tagging test patients across sites?

jon_duke · December 24, 2014, 5:23pm

Vojtech,

Excellent point to raise, as we certainly do have test patients in our dataset that are not cleanly defined. I think the optimal solution is to use a convention, as you’ve said, rather than a schema change. While I’m not in love with the idea of using Location, it is probably the most benign approach (compared with coded concepts such as gender / race / ethnicity). Ideally it would, as you say, use a standard integer value such as location_id=0 so that it doesn’t matter if people call the location Test-Patient or Test Location or Test Whatever.

Certainly all kinds of magic things could be done with fact_relationship as well, but that seems like it might introduce more complexity to queries than necessary. @Christian_Reich could chime in on that.

So in short, I think your reasoning is solid and recommendation is sensible.

Jon

Christian_Reich · December 24, 2014, 6:09pm

Friends:

Why do you need to have your test patient(s) mixed with the real? Wouldn’t it make more sense to have them in a separate database called “test”, where all patients are fake?

Christian

Patrick_Ryan · December 24, 2014, 8:25pm

I agree with Christian, I don’t think we should allow ‘fake’ data into the real database that is going to be used for research. The standardized analytics we develop will not expect that they should be filtering out ‘invalid’ patients, and it seems an unreasonable expectation to demand all analysts check for bogus records before they conduct the analysis. I would propose that the community-wide convention be that if a source database contains ‘invalid’ patients, that they be removed during the ETL into the OMOP CDM instance.

Cheers,

Patrick

jon_duke · December 24, 2014, 9:46pm

The numbers are small in any case, but I am certain that every source database (even commercial ones) contains test patients. Electronic health data are generated by interfaces that are tested at some point by developers who are not thinking about secondary use. New lab interfaces come online, EMR training patients, pharmacies testing their claims systems, whatever it might be.

I am all good with a convention of quarantining into a separate CDM or just removing identifiable test patients, but I think we need to accept the reality that some fractional amount of fake data lives within every observational dataset. Some are easily identifiable and some are not, so that is one more opportunity for the Heroes of OHDSI (should that be in someone’s wheelhouse).

Christian_Reich · December 28, 2014, 12:50pm

Gents:

At the end of the day, this is ETL land. Only there you could identify test patients. There are 3 scenarios:

You can identify them and want to remove them - do that. No CDM involvement. That’s what I would do if I did an ETL.
You can identify them and don’t want to remove them for whatever reason.
If there is only one, the convention could indeed be to assign person_id=0 to that patient. But that wouldn’t work if there are many, and in those cases, you can assign a block of pre-specified person_ids. However, that cannot be a general CDM convention, as the content of the person_id field is up to the ETL makers (as long as it is an integer).
I am not sure we could use the FACT_RELATIONSHIP table. What other record whould you link it to in which table? There is no place other than the PERSON table to store general information about a patient, and “fake” or “real” have nowhere to go there. We have no person_concept_id field to put anything about the patient in. Do we need that? For any other use case than the fakes?
You cannot identify them. Too bad. Nothing we can do. Hopefully, they are few amongst a whole lot of many real patients.

Let us know, in particular, if there is a need for any other general patient characterization that we would want to standardize.

Happy New Year!

wstephens · December 29, 2014, 5:27pm

This mix of test and real patients is a concern and I agree with @Christian_Reich that it is an ETL issue. This is why we typically require a definitive list of patient identifiers (or SQL that defines the cohort) that our ETL process leverages as the initial inclusion criteria. This makes the research organization the responsible party for knowing which patients are to be included in the conversion to OMOP.

Bill