Proposal for Cost Table Adjustment

Klaus · October 24, 2016, 3:55pm

We have the design principle that all domain related data tables are person centric where for each record the person_id and a date are captured at a minimum.

We deviate from this design principle in the COST table. The COST table refers to the id in the data tables. Even though a record in a data table is uniquely identified by the id in that table an additional person_id and ideally a date would be beneficial:

Maybe an additional person_id will make the SQL statement in database
systems faster.
More importantly, a missing person_id prevents other
lightweight non-database systems to process all transaction tables
chunk wise.

The idea is to add the person_id and ideally the date of event to the COST table.
What do you think?
Klaus

DTorok · October 24, 2016, 4:59pm

Reason NOT to make the suggested change is that adding the person id and date will further de-normalize the OMOP CDM Schema. Normalization make it harder for the database to have contradictory information and by eliminating redundancy reduces the size of the dataset. I think the reason for making the suggested change should be more substantial if the change will de-normalize the cost table.

Chris_Knoll · October 24, 2016, 5:39pm

I’d like to voice support for adding person_id to the cost table. In certain database environments (specifically Massively Parallel Processing (MPP) that we see in redshift and MS PDW (and possibly HiveDB on Hadoop)) we want to be able to hash on a field of the table. By NOT having a person id in the cost table, we can’t hash the data such that the data for a given person can be co-located on the same cluster. By HAVING this person_id field, we can define the table as hashed on person_id.

-Chris

DTorok · October 24, 2016, 7:38pm

Current rules of the CDM allow adding columns to a sites tables if they
have a particular need. So In certain database environments (specifically
Massively Parallel Processing (MPP) like redshift and MS PDW (and possibly
HiveDB on Hadoop)) add the Person Id so that you can hash on the field. Do
not see a problem, but do not think adding the column to the ‘standard’ CDM
is a good idea.

jenniferduryea · October 24, 2016, 8:13pm

Personally, I do not have an issue with adding or not adding the person_id field to the Cost table. Whatever makes it easier is my motto

However, I have a question about @Klaus’s proposal to add an event_date to the Cost table. Will this event_date be the date of when the cost was actually made? Payments can be made months after the actual procedure/service. But I’m not sure why this information is important, unless you are studying medical billing practices.

Or is this event_date the same date value as the referenced record ID from the cost_event_id? If it is the latter, should you include a start and end date (similar to the visit_occurrence or observation records) in this request? Will it ever be different?

Chris_Knoll · October 24, 2016, 9:27pm

Hi, that’s a good idea. The problem is that I don’t think that the query optimizer will leverage a hashing column if the column isn’t applied in the query. So if i’m trying to calculate an average cost of something by person from the cost table, i’d have to go to the drug_exposure table, join back to cost (on the domain=“drug”) and the drug_exposure_id (neither of these are hashed) which I am pretty sure will cause a sort of ‘shuffle’ move of data from the cost table (which can only be hashed on a cost_id) UNLESS i also can say where cost.person_id = drug_exposure.person_id.

I could do this specifically at my site, and then write queries to perform that join, but then I can’t put that logic in any of our OHDSI tools because it’s not a standard column, so I’m left not able to use it anyways (we only want to write queries that can be shared in a network).

-Chris

Klaus · October 27, 2016, 10:04am

Hi Jennifer,

Ideally I would like to have the same event that is used in the data table, for example drug_exposure_start_date in the case of the drug_exposure table.

Best,
Klaus

Klaus · October 27, 2016, 10:23am

@DTorok:
I see your valid point.
However, if we want to focus on normalization, we wouldn’t design the cost table that way. We would rather use a cost_type and one cost_value instead of many COST entries, which are partly used in databases outside the U.S.

Having that massive de-normalized COST table an additional person_id would not add much size but would rather enable us to process huge datasets without the need of spending cost expensive hardware and software resources for building up huge databases for analyzing the data.

Gowtham_Rao · March 9, 2017, 2:42am

@klaus @chris_knoll I support the idea is putting person_id for multiple reasons including the performance in database appliances that use hashing and to continue the person centric convention of every table.

We should make this a proposal that is a topic for the F2F. We really like the cost table, but the absence of the person_id makes it difficult to use.

Is it true that OHDSI applications like Atlas don’t use the cost table because of the absence of person_id? Currently it is not possible to use Atlas to build a cohort who have had visits with amount > 100,000$s for example.

Gowtham_Rao · March 9, 2017, 2:56am

I think the actual service dates e.g. procedure date, visit date or drug dispensation date should be in their respective tables as is.

I do however see value for two dates in this table. The date billed and date paid. These are important information for health plan actuarial analysis that use it to estimate IBNR. Incurred but not reported - Wikipedia

Happy to make a proposal in a different thread to request this. There are a lot of use cases for this date enhancement from health economics and actuarial standpoint.

Chris_Knoll · March 9, 2017, 4:37am

No, the tools could use cost ID, it would just have to join to the appropriate domain table to figure out who the cost is associated with. It just hasn’t been implemented yet because there hasn’t been a defined analytical use case for it and that sort of thing drives the priorities.

Gowtham_Rao · March 9, 2017, 11:14am

Thank you @chris_knoll and group

In addition to what you already mentioned above, what are the reasons to add person_id to cost table?

query efficiency and ease of use from analyst point of view
hashing in appliances that distribute based on person_id
general convention of representing person_id in OMOP tables

What else? Would like to propose adding person_id at F2F.

Chris_Knoll · March 9, 2017, 7:36pm

I think all of those are reason enough. I think for it to get incorporated into the actual CDM schema, we just need to drum up enough support for it and push it through the WG.

Gowtham_Rao · March 10, 2017, 3:02pm

Yup - see proposal

http://www.ohdsi.org/web/wiki/doku.php?id=documentation:next_cdm:add_person_to_cost

Please make changes/improve

Gowtham_Rao · March 10, 2017, 10:19pm

@DTorok
when you say normalized cost table - are you refering to something like this?

Normalized cost table

How would the table look like?

Christian_Reich · March 11, 2017, 3:00pm

@Gowtham_Rao

Added Klaus’ last name. Other than that - pretty clean. The only other caveat is the possibility of database integrity conflict: The person_id passed through the event table might be different from the one in the COST table. We need to add it to the constraints (slow), a warning in the description or data quality tools like ACHILLES HEELS, which should watch that kind of thing. Otherwise it will happen somewhere.

Can you add that to the proposals? I think it’s the right thing to do? I hate those strings. Same in FACT_RELATIONSHIP. Which I also hate.

Gowtham_Rao · March 11, 2017, 4:33pm

Will add normalized table to cost table proposal. Agree, this is probably how it should be

Can we make cost it an agenda item at F2F next week?

Christian_Reich · March 11, 2017, 4:40pm

You got it.

Klaus · March 13, 2017, 11:48am

Thanks for supporting this.
Can we make one of the datetime fields mandatory to fully satisfy our design principle of having a person_id and a date?

Gowtham_Rao · March 13, 2017, 11:58am

I would say that majority of secondary data sources will not have financial transaction dates. If we make it mandatory, then a default date will need to be forced thru some form of imputation.

Why is date mandatory?