We have the design principle that all domain related data tables are person centric where for each record the person_id and a date are captured at a minimum.
We deviate from this design principle in the COST table. The COST table refers to the id in the data tables. Even though a record in a data table is uniquely identified by the id in that table an additional person_id and ideally a date would be beneficial:
Maybe an additional person_id will make the SQL statement in database
systems faster.
More importantly, a missing person_id prevents other
lightweight non-database systems to process all transaction tables
chunk wise.
The idea is to add the person_id and ideally the date of event to the COST table.
What do you think?
Klaus
Reason NOT to make the suggested change is that adding the person id and date will further de-normalize the OMOP CDM Schema. Normalization make it harder for the database to have contradictory information and by eliminating redundancy reduces the size of the dataset. I think the reason for making the suggested change should be more substantial if the change will de-normalize the cost table.
Iād like to voice support for adding person_id to the cost table. In certain database environments (specifically Massively Parallel Processing (MPP) that we see in redshift and MS PDW (and possibly HiveDB on Hadoop)) we want to be able to hash on a field of the table. By NOT having a person id in the cost table, we canāt hash the data such that the data for a given person can be co-located on the same cluster. By HAVING this person_id field, we can define the table as hashed on person_id.
Current rules of the CDM allow adding columns to a sites tables if they
have a particular need. So In certain database environments (specifically
Massively Parallel Processing (MPP) like redshift and MS PDW (and possibly
HiveDB on Hadoop)) add the Person Id so that you can hash on the field. Do
not see a problem, but do not think adding the column to the āstandardā CDM
is a good idea.
Personally, I do not have an issue with adding or not adding the person_id field to the Cost table. Whatever makes it easier is my motto
However, I have a question about @Klausās proposal to add an event_date to the Cost table. Will this event_date be the date of when the cost was actually made? Payments can be made months after the actual procedure/service. But Iām not sure why this information is important, unless you are studying medical billing practices.
Or is this event_date the same date value as the referenced record ID from the cost_event_id? If it is the latter, should you include a start and end date (similar to the visit_occurrence or observation records) in this request? Will it ever be different?
Hi, thatās a good idea. The problem is that I donāt think that the query optimizer will leverage a hashing column if the column isnāt applied in the query. So if iām trying to calculate an average cost of something by person from the cost table, iād have to go to the drug_exposure table, join back to cost (on the domain=ādrugā) and the drug_exposure_id (neither of these are hashed) which I am pretty sure will cause a sort of āshuffleā move of data from the cost table (which can only be hashed on a cost_id) UNLESS i also can say where cost.person_id = drug_exposure.person_id.
I could do this specifically at my site, and then write queries to perform that join, but then I canāt put that logic in any of our OHDSI tools because itās not a standard column, so Iām left not able to use it anyways (we only want to write queries that can be shared in a network).
Ideally I would like to have the same event that is used in the data table, for example drug_exposure_start_date in the case of the drug_exposure table.
@DTorok:
I see your valid point.
However, if we want to focus on normalization, we wouldnāt design the cost table that way. We would rather use a cost_type and one cost_value instead of many COST entries, which are partly used in databases outside the U.S.
Having that massive de-normalized COST table an additional person_id would not add much size but would rather enable us to process huge datasets without the need of spending cost expensive hardware and software resources for building up huge databases for analyzing the data.
@klaus@chris_knoll I support the idea is putting person_id for multiple reasons including the performance in database appliances that use hashing and to continue the person centric convention of every table.
We should make this a proposal that is a topic for the F2F. We really like the cost table, but the absence of the person_id makes it difficult to use.
Is it true that OHDSI applications like Atlas donāt use the cost table because of the absence of person_id? Currently it is not possible to use Atlas to build a cohort who have had visits with amount > 100,000$s for example.
I think the actual service dates e.g. procedure date, visit date or drug dispensation date should be in their respective tables as is.
I do however see value for two dates in this table. The date billed and date paid. These are important information for health plan actuarial analysis that use it to estimate IBNR. Incurred but not reported - Wikipedia
Happy to make a proposal in a different thread to request this. There are a lot of use cases for this date enhancement from health economics and actuarial standpoint.
No, the tools could use cost ID, it would just have to join to the appropriate domain table to figure out who the cost is associated with. It just hasnāt been implemented yet because there hasnāt been a defined analytical use case for it and that sort of thing drives the priorities.
I think all of those are reason enough. I think for it to get incorporated into the actual CDM schema, we just need to drum up enough support for it and push it through the WG.
Added Klausā last name. Other than that - pretty clean. The only other caveat is the possibility of database integrity conflict: The person_id passed through the event table might be different from the one in the COST table. We need to add it to the constraints (slow), a warning in the description or data quality tools like ACHILLES HEELS, which should watch that kind of thing. Otherwise it will happen somewhere.
Can you add that to the proposals? I think itās the right thing to do? I hate those strings. Same in FACT_RELATIONSHIP. Which I also hate.
I would say that majority of secondary data sources will not have financial transaction dates. If we make it mandatory, then a default date will need to be forced thru some form of imputation.