OHDSI Home | Forums | Wiki | Github

How to handle missing dates in all CDM tables?

Hello,

I am interested to know from OHDSI veterans here on how they handle missing dates under different domain tables?

For example, we have millions of measurement data, but I see we don’t have measurement date for 10000+ records. Not sure whether it is negligible enough

Similarly, for drug tables as well. We don’t have drug exposure start /end dates for drugs

However they are not major but definitely something which often creates some issues during Achilles execution etc

Is there any logic that you people follow to fill in the missing dates under different domain tables?

For example, visit end date column has a logic based on patient visit type. Similarly do we have any logic to fill missing dates for drug, measurement and condition domains?

Can help us with this?

2 Likes

If you know a measurement was performed at a visit, you could potentially infer the measurement date from the visit it was captured at. Possibly you could do the same with drug exposures.

From a CDM perspective, start dates are critical. if you don’t have a start date, you can’t create the patient journey through your healthcare system. If you want to say that you can only do non-temporal type of analysis (like a logistic regression can be done, but not a cox hazard since cox depends on survival time, and without dates, you have no time), then I suppose it is possible that you could define all your observations to appear on exactly all the same dates, and then do custom analysis on the presence of a feature and the presence of an outcome in the same person…but would strongly discourage trying to ‘mix’ patients with dates with patients without dates. If you don’t have a minimum viable data for a patient, I’d exclude them.

All Atlas tools are heavily dependent on start dates, so if you don’t have them, you’ll be severely limited.

End dates, on the other hand, are a little more forgiving. In many cases you can just default the end date to be start_date + 1 day. In some of our tools, you can override an event’s end date by offsetting the event’s start date…so you could define an exposure cohort as looking at the drug exposure start_date, and assume a 30d persistence window of the exposure by using a 30d offset for the cohort exit. For Incidence and Characterization, we’re usually looking at things that start relative to cohort index (which is the cohort start date).

Whatever decision you make, you should declare those assumptions and imputations in any evidence you produce from the data.

1 Like

Hi @SELVA_MUTHU_KUMARAN,

You can use @Chris_Knoll 's logic above, but generally it is best to toss these records or create a record in the Observation table for ‘past medical history’ or ‘past surgical history’ for the Condition or Procedure. You know the Person had it in the past, you just don’t know the day, so it might be useful criteria for a study. You could use the date of the Visit or the date of the data extract for the Observation start date.

Measurements are tricky because the Measurement doesn’t always occur on the same day as the Visit. Many times Persons have the Measurement done a week before the Visit, so the Provider can review results and assign a course of treatment. Other times the Measurement is done on the day of the Visit. And sometimes the Measurement is taken days after the Visit because the Person needs to do certain things (like not eat for 12hrs) before a Measurement can be taken. I say toss the dateless Measurement records. The timeline and order for clinical events are very important.

If an end date can be NULL (i.e. condition_end_date) and your data does not contain an end date, then leave it NULL.

Now for drug end_dates, see the conversation here. If the drug record is an “administered drug” record, then the end date = start date. There may possibly be very few cases where this is not true (example: iv infusion started before midnight and ending after midnight), but it’s such a small percentage that it won’t affect analysis. If the assumption concerns you, then toss out all iv infusion records without an end date. For other drug end dates you could derive them by start date + days supply = end date or other calculations using the start date (has to be populated) and the quantity, dose, sig, etc.

Condition and Device end_dates are not required, so leave them NULL if you don’t have an end date. Measurement and Procedure do not have end dates.

1 Like

Hello @MPhilofsky,

Another quick question related to this. not related to dates but related to visit_occurrence_ids

Are all lab/condition/drug records always have an associated visit? I understand ideally we expect it to be tied to a visit.

Because I see in our source data, there are lab/drug/condition records which stand isolated from our visit records.

Meaning patient A has visited hospital as outpatient on Jan 1st 2010 and left on the same day. Similarly he has 5 different outpatient visits on different days (ex: Jan 1st, Feb 21st, Mar 23rd, Apr 25th, Jun 26th) and he has corresponding lab/drugs/conditions data for these 5 visits…

But if I dig further his lab records, I see that he has records associated with 10 unique visit_ids (visit_occurrence_ids) implying 10 visits to hospital. So it’s like we have additional lab information for 5 more visits which aren’t recorded in the visit table at all. And these visit times are away from the recorded 5 visits by more number of days like 2 months, 3 months etc

Have you encountered in your experience where records outside/not related to visit_ids in the visit table are present in other domain tables like lab, drugs, conditions etc.?

If yes, how do you handle such scenarios? Can you share some suggestions on how to address this?

@Chris_Knoll - Another question related to this w.rt to Atlas part.

Does Atlas/OHDSI tools consider visit_occurrence_ids while running the queries in back end?

Meaning, for the records which are outside (by more number of days like 2 months, 3 months) of recorded visit, I am planning to leave the visit_ids as null (in domain tables like measurement, condition, drug_exposure). For rest of the records, I will anyway have the visit_ids from visit_occurrence table because they are present in lab as well as in visit table (as given by source data).

So will ATLAS while running cohort generation (with visit_occurrence domain in criteria), I guess it will consider visit_occurrence_id to join records as well.

If yes, then do you recommend to drop these records (which are present in lab with a visit_id but not present in visit table) ?

I have reached out to data source owners as well but just wanted to check and know from you veterans whether it is common issue, how do you address this while data transformation or it’s unusual thing happening only with our data?

The only place we look for visit_occurrence_id is when you look for a correlated criteria (where you say Having at lest X of {domain observation data} between X days before and Y days after), and you specify that the associated observation should occur ‘at the same visit’.

Otherwise, we’ll just consider the visit dates in the correlated criteria, and ignore anythign about the visit_occurrence_id in the domain tables (condition, drug, procedure)

1 Like

Hello @SELVA_MUTHU_KUMARAN,

No, sometimes a lab Measurement, Drug dispense, or other clinical event happens outside of the Person-Provider Visit. And that’s ok.

I am not aware of a use case that would require all clinical events to be linked to a Visit. So, we don’t worry about the lack of a visit_occurrence_id in a clinical event table. Do your researchers have a use case for this?

1 Like

@MPhilofsky - No use case as such. I wasn’t sure whether it is usual to have clinical data not tied to any visit (in data source). So I was trying to learn on how to address this issue. useful to know your suggestions on this.

@Chris_Knoll - I believe you meant the below one. Thanks for taking time to help me on this

image

In our project we are missing required date field like measurement_date, visit_start_date, condition_start_date, procedure_date, etc. Which method is very efficient and effective to handle this problem? We have came up with few ways mentioned below. Could you guys please help me?

  1. Ignore these such records.
  2. Use the default date (1800-01-01).
  3. Find max date from other tables like procedure_occurrence, condition_occurrence, measurement, device_exposure, etc.

If your procedure/condition/measurement/device record has a FK to a visit/encounter record, then use the date of the visit/encounter as a good proxy for when the procedure/condition/measurement/device event occurred. This can be done bi-directionally, so if the visit/encounter record has a FK to a procedure/condition/measurement/device record, then use the date of the procedure/condition/measurement/device record for the visit date. IF not, ignore the records and don’t bring them into the CDM.

Never do this. Nobody on this earth was alive in 1800 and we sure weren’t recording their clinical events in some type of electronic storage. Most research studies are looking for what clinical event happens before or after X event. Accurate dates are very important.

Don’t do this. Accurate dates are important.

I understand the importance of knowing temporal sequence, but struggle with the idea of dumping records with missing event dates - the information may be be better used than ignored in some situations. Say I want a cohort of new users of treatment Y who are naive to treatment X, or don’t have indication Z. I could take a conservative approach and exclude patients with a prescription for drug X, or condition Z even though the relevant event date is not known.

In our UK primary care EHR database, we define the Observation Period by patients’ registration start and end dates, but the EHR record can have a lot of medical history with event dates stretching back many years before the start of registration. This information does sometimes get used in the way described above. So back to the missing dates, would it (ever) be appropriate to impute a date - for example 1 day before the Observation Period start date - so that the investigator has the option to use that information should they wish (i.e. by including information outside of the Observation Period). Or would such an approach risk introducing a lot of confusion for limited / undefined benefit?

I think you can safely use the clinical events with missing dates as exclusion criteria only. As in “exclude any and all persons with X at any time”. With X being the clinical event missing the date.

If you believe your data are accurate and the clinical event which lacks a date really did happen, then you could put these data into the Observation table with other “medical history” data. The date of the Observation record is the date the data were recorded. So, this would be the date of the ETL. The observation_concept_id = 1340204, History of event. And the observation.value_as_concept_id is the clinical event which lacks the date. Make sure you document all these decisions in an easily readable specification document, so the end users know you included these data and where and how to find them

@MPhilofsky do you know if these recommendations were ever discussed further and/or recorded as conventions anywhere? if no (and if you agree it’s THEMIS-worthy), i’d be happy to “sponsor” the discussion. let me know :slight_smile:

[I checked CDM docs and all I find is this note on observation.observation_start_date: “For some observations the ETL may need to make a choice as to which date to choose.” Also don’t see anything on this topic in the THEMIS repo.]

Hello @katy-sadowski,

Yes, please sponsor this as a Themis issue. More conventions, less ambiguity! Ping me if you have any questions on the Themis process.

:tada: will do! thanks!

t