Timing of time

hripcsa · September 30, 2016, 1:23am

What is the expected timing of adding time to the CDM? Here is what I believe is the proposal:

Keep all _DATE fields as is, still in DATE format.

AND

Add _TIME fields to all places where there are _DATE fields, and make those _TIME fields be of type DATETIME, meaning it contains the date and time.

The _DATE fields will remain REQUIRED non-null fields.

The _TIME fields will be OPTIONAL fields, since many sources do not contain times and we do not want people to inappropriately default to 12:00 or something else.

Also, note, this proposal was submitted simultaneously with the proposal to make all tables have _START_DATE and END_DATE fields for all domains. This means PROCEDURE_OCCURRENCE, MEASUREMENT, OBSERVATION will be changing from having only 1 date to having 2 dates, and both would required (even if in some systems END_DATE = START_DATE).

bailey · September 30, 2016, 2:46am

FWIW, when we added these for PEDSnet, we did decide to default to midnight and require values whenever the date was known, since having fields with consistent timestamp semantics was a greater benefit than not being able to ascertain whether the event actually happened at midnight to the second.

The impact was greatest for person, where there was otherwise no date or datetime field for date of birth. That made computing ages expensive, which was a major handicap for a pediatic network.

hripcsa · September 30, 2016, 1:32pm

Charles, are you saying that in general you kept the DATE column and added a TIMESTAMP column but made the latter required (and that requirement is what differed from the proposal)? Or that you converted all the DATES to TIMESTAMP without adding a new column? I am just trying to figure out if the proposal would have met your needs.

George

bailey · September 30, 2016, 3:03pm

That’s essentially it. The driving use case for is was the need to compute age efficiently, with the secondary use case that sub-date resolution was important for some analyses.

What we ended up doing to solve both was add the *_time fields across the board with timestamp semantics. We left *_date fields untouched because we didn’t want to break the existing model.

I haven’t seen the new proposal in detail, but from your prior message it seems like it’d work for us; we’d just layer on the use of defaults. Can you point me to the details so I can check?

Christian_Reich · September 30, 2016, 5:34pm

Friends:

Can one of you with a strong idea put the proposal in? I made a new slot at #27:

mgkahn · September 30, 2016, 5:55pm

My tweak of George’s original proposal is to call each of the optional new fields X_DT rather than X_TIME so that it is clear that these fields actually how a datetime value whereas _DATE is just a date. The label _TIME could imply that it holds just the time component of a datetime.

rimma · September 30, 2016, 6:45pm

I have an alternative proposal. We did it at NYC CDRN. The idea is to have one date/time field in each table that has date/time data plus an additional field that contains a concept indicating time precision (e.g. ‘month’, ‘minute’). For example, if only year and month were known, the date/time would contain ‘1989/12/01’ and the precision field would contain ‘month’. This design would provide unambiguous information about date/time precision, support data normalization to the same precision, provide foundation for uniform queries.

Would appreciate your feedback.

Chris_Knoll · September 30, 2016, 7:24pm

I’d like to vote for a single field to store both the date and time. It simplifies comparing dates and finding offsets when it’s all in a single field vs. when you have to use 2 separate fields together (unless someone can explain the failure in my thinking)…

Consider the following set of time indexes split into 2 columns: one for date one for time:

eventId      date           time
1            1/1/2010       6:30 AM
2            1/2/2010       3:30 AM
3            1/3/2010       9:30 AM

If you want to find all the events after 1/2/2010 6:00 AM. you can’t simply say 'where eventDateTime > 1/2/2010 6:00AM, you have to say ‘where eventDate > 1/2/2010 or (eventDate = 1/2/2010 and eventTime > 6:00 am)’. It gets even more complicated if you want to try to find the difference in time (say, in hours) between the two dates (exercise left to the reader). To make it easier for myself, i’d always just create a surrogate dateTime value out of the date + time columns in order to do any meaningful work, and if i’m forced to do that everywhere I want to use the columns, wouldn’t it be better to just bundle the date + time together into one field?

-Chris

Christian_Reich · September 30, 2016, 7:34pm

@Chris_Knoll:

The by far largest use case is to calculate differences in dates. Usually these differences are 30 days, 60 days or 180 days, essentially the washout periods and temporal cohort logic. Sometimes they are years (when it’s about age). Right now, all the tools assume date arithmetic to work like this: Date + integer (assumed days) = date.

Will that all break with your idea? Can all the SQLs and SASs of the world do it right?

Sigfried_Gold · September 30, 2016, 7:59pm

Casting my vote for Rimma’s proposal – which is also consistent with what
Chris wants. I think it gives the most power and shouldn’t require changes
in existing code. Maybe someone can let me know if I’m wrong, but I believe
date arithmetic and most other functions can work interchangeably between
date and datetime fields in most RDBMSs.

hripcsa · September 30, 2016, 7:59pm

I think either _TIME or _DATETIME. They may misunderstand DT. George

Patrick_Ryan · September 30, 2016, 8:19pm

Unfortunately @Sigfried_Gold, you are wrong. It turns out different RDBMSs
do handle _DATE and _DATETIME datatypes differently, and the functions that
are used for date and datetime arithmetic are different as well.
Independent from the discussion of the new CDM modification (for which I
will cast my vote for new fields name _TIME and to be of type ‘datetime’),
we still have an outstanding issue that SQLRender cannot handle, which is
how to gracefully handle time-based calculations across environments.

bailey · September 30, 2016, 9:09pm

@Chris_Knoll I agree it’s important that the fields, whatever they’re named, have datetime/timestamp semantics. As @Christian_Reich mentioned, most intervals will be more than 24 hours, and the ability to do computations using a single field will be well worth the redundant date info. (I’ll bet it doesn’t even widen the storage on some RDBMS.)

I’ve no problem with taking @rimma 's approach, either, as long as the semantics of the value are consistently datetime/timestamp. For elements more precise than the source data, did you default to 0 (e.g. ‘2015-01-01’ with ‘day’ resolution was stored as ‘2015-06-01T00:00:00’?

As an aside, after some back-and-forth discussion and experience, we ended up standardizing on local timezone for these fields. That of course means we can’t correctly construct sequences of events that occurred in care_sites in different timezones, but we agreed that’s an edge case.

hripcsa · October 1, 2016, 2:17pm

So here is where I think we are:

We want not to have to change code, at least at this point, and we are pretty sure that a switch from DATE to DATETIME will in fact require changes. This affects @rimma’s proposal and pushes us to keep the *_DATE fields.
No one seems to want a pure TIME field. So whatever we do, it will be DATETIME (AKA timestamp).
On naming, we want something short, but I think DT is too obtuse. So I guess I am ok with staying with *_TIME, but I could be convinced to *_DATETIME .
We could still adopt @rimma’s proposal to add a granularity field for the new DATETIME field. So it would be *_DATE, *_TIME (timestamp), and *_TIMEGRAN or *_TIMEPREC or something. And we could plan for a slow migration to drop the *_DATE fields in the long run. We did get a query yesterday in the implementers group from @JASimmons about year and month granularities. The question is whether the complication is worth the benefit. Does this mean we would want to come up with a semantics for manipulating such timestamps, mainly carrying the precision forward in the calculations? That might be more work than it is worth (e.g., the answer to, “did event XX occur before date DD,” could be MAYBE). Or we could just treat it as if the event occurred at the start of the interval.
This does not address @bailey’s original concern. It sounds like the *_DATE fields were too slow, and that is why he added timestamps. In most RDBMSs I know, DATE is implemented as an integer internally and should be lightning fast to calculate age to the day. I wonder if they got implemented as strings or something.

(I just realized I never answered @bailey with the details. Those were the details, as far as we have gotten.)

George

hripcsa · October 1, 2016, 2:26pm

Oops. Never mind #5. I see now. It was specifically the PERSON table with split out year, month, and day. Thanks.

Patrick_Ryan · October 1, 2016, 2:41pm

I do not like the idea of having to look up granularity in a separate field
in order to make use of data in a different field. From an analytics
perspective, that causes tremendous inefficiencies, as I have to do a
lookup and case statement everytime I want to do date arithmetic (which is
all the time). In the 13 databases I regularly work with, we only receive
dates and I do not expect that we’d ever get times. So while I can
support the addition of optional datetime fields for other use cases like
George and Charlie have laid out, I do not think these trump the more
expected use cases the community already supports or should result in
retiring the existing _date fields throughout the cdm.

hripcsa · October 1, 2016, 2:44pm

I added a time proposal in the slot at #27 http://www.ohdsi.org/web/wiki/doku.php?id=documentation:next_cdm .

Christian_Reich · October 1, 2016, 2:49pm

Thanks, @hripcsa.

I am with @Patrick_Ryan. The perfect is the enemy of the good, and the good is 98% of the situations/use cases. We need a solution that is in addition, not instead of the current situation.

hripcsa · October 1, 2016, 2:50pm

@Christian_Reich, looks like #15 and #27 need to be merged. #15 states the problem, and #27 describes the details.

bailey · October 1, 2016, 3:20pm

@Patrick_Ryan That’s fine with me. Just to articulate the other side of @hripcsa 's #1: if we were looking for minimal change, we could solve the analytic problems with two changes:

Add a birth_date field to person., with appropriate conventions for defaulting (e.g. if you only know year of birth, store year-01-01 or somesuch.
Change the semantics of all existing *_date from DATE to DATETIME.

The upsides of this approach are that we’d get efficient calculation of intervals, sub-date precision when we needed it, and no added space consumption for the (common?) case that we’re just dealing with dates. The practical downside is that existing code that relies on a_date - b_date yielding an integer will break, since most RDBMS I’ve encountered use some interval type for the difference in datetimes, which may or may not (looking at you, Postgres) quietly cast to an integer when needed. The informational downside is that we won’t know what the precision of dates are within the data model. (There’s also the fact that DATETIME types have a shorter span of actual dates they can cover than a DATE type, but I don’t see that as a problem for OHDSI; by the time it’s a problem native types will probably be wider. )

When we extended the CDM for PEDSnet, we went with a separate set of *_time fields because we thought that the change in date math would break code that assumed it was dealing with DATE types. That’s still a real consideration, but I can see an argument that the pain of a flag day now is worth the gain of having a single place for temporal information going forward

To be clear about @rimma 's proposal, the *_datetime fields will have datetime semantics, so you don’t actually have to consult the precision field to do date math; it’s just there as advisory metadata if you want it. By biggest reservation is that it consumes row-level storage for something that’s likely constant across large swathes of data, so may be more appropriate for a metadata table. The question is how often you’d want it while doing, versus how often you want to record it so the user will know about it.

To @hripcsa 's other points:

(markdown bait)
Agree
I’m not as worried about _dt. I agree it’s not as clear on first reading, but it’s fine once you know it, so we’re talking about a one-time entry cost. And it’s a much lower cognitive barrier than x_concept_id vs x_type_concept_id vs x_source_concept_id, for instance.
I like the idea of deprecating and eventually removing the _date fields, since it sounds like the only reason to separate them is backwards compatibility. (Well, we also need to teach SqlRender when to use extract(days from a - b) or similar, but that’s manageable. I do wonder whether the precision fields might be candidates for extending cdm_version, though that’s got its own set of edge cases. Minimally, it gets us to something like PCORnet’s “harvest” table, which looks like a lot of bookkeeping, though it’s only once per instance, and if most of it’s optional, can be used only when needed.