OHDSI Home | Forums | Wiki | Github

Proposal to add a new table - Dimension Attribute table into CDM

I would like to bring this topic to the community to discuss.

My recent OMOP conversion source data is a data warehouse from one of the largest EHR vendors in US. Its data has some care site attributes such as Hospital bed size range (200-299, 300-499, 500+ etc.), Teaching hospital indicator, Rural-urban indicator and etc. I have asked them how are these data element used and they did provide me with use cases. Also I think attributes describing care site, location, and provider can be very useful in healthcare research. Some of the attribute examples are listed below and lots of them have been used in research literature:

PROVIDER

  • Years in practice

  • Years practicing in current setting

  • Doctor’s rating (WebMD, Healthgrades etc.)

  • Language spoken

  • Job tile (director, department chairman etc.)

  • NPI equivalent in other countries

CARE_SITE

  • Hospital bed size range (200-299, 300-499, 500+ etc.)

  • Teaching hospital indicator

  • Rural-urban indicator

  • Practice type (solo practice, episodic care practice/walk-in clinic etc.)

  • Total number of clinical and administrative staff

  • Number of patient visits per week in primary practice

  • Focused practice scope (yes/no)

LOCATION

  • Rurality (metropolitan, suburban, rural etc.)

  • Accessibility to health care

  • Population density

  • Air Quality Index (AQI)

  • Average temperature

  • Annual days of sunlight exposure (ASE)

  • Region

  • International address format (District, Province, Region etc.)

  • Latitude

  • Longitude

  • Altitude (height above sea level)

And the list goes on and on. Currently our person centric OMOP CDM model does not support including these data elements. Now one choice to expand current dimension table structure by add all these attributes into their respective tables, i.e., Provider, Care_site and Location table. But that is disruptive and also not economical since in many cases, source data do not provide these information. So I am proposing to add a row based new table, called Dimension_Attribute (or any name you want to call it) as a comprehensive way to include these attributes in case source data does provide such information. It also preserves the original dimension table structure and is backward compatible. The table will have following columns:

  • Domain_id - This will have one of the 3 values: Provider, Care_Site or Location

  • Dimension_id - This is the same id number as in their respective dimension tables, i.e., Provider_id, Care_Site_id or Location_id

  • Attribute_concept_id - Cconcept_id for the attribute, e.g., 55556666 for Air Quality Index (AQI)

  • Attribute_source_value - Source value for the attribute, e.g., Air Quality Index

  • Value_concept_id – Concept_id for the value of the attribute, e.g., 22223333 for AQI 0 -50; 33334444 for AQI 51-100; 44445555 for AQI 201-300 etc.

  • Value_source_value – Source data value for the attribute, e…g, AQI 35, AQI 95, AQI 268 etc.

  • Attribute_dt – This is the date when attribute value was reported. It is optional.

For a detailed table structure and examples, please see the spreadsheet attached.

Dimension attribute.xlsx (16.7 KB)

Since the table is row-based and not column based, it retains the capability to add as many attributes as needed without disrupting table structure. As a general rule, the table should mostly contains value provided by source data asset, and not derived value. However, if absolutely needed, such as in the case of spatial epidemiology (Geospaital functionality in Atlas: integration of AEGIS · Issue #649 · OHDSI/WebAPI · GitHub), this table can also accommodate derived value as well.

I am not sure if this has been proposed before. If yes, I apologize for not acknowledging you but I am not able to find it. Please feel free to give your opinion on this.

Thanks,

QI_omop:

Good stuff. We should think about these. But let’s tidy things up a little bit.

Can you share, please?

Obviously, in longitudinal data this won’t work, because it will change. Every year it will change by a year. So, if anything we would have date_of_graduation or something.

Not sure. Those tend to be the things we call “evidence”, and we would want to produce. What WebMD etc. have is called “hearsay” or “rumor”.

Can you conceptualize that? Do you have a good list?

Yeah, this is a bad term. What would you call it?

Sounds like something that belongs to Location. Plus, it’s not data, it’s reference. Because it doesn’t change in the time frames we are talking here.

That’s Visit in our lingo now. Used to be Place of Service. And it is not an attribute of the Care Site, because one Care Site can have many different constellations.

Sounds like a read-out, not reference data.

Not sure what that is. Focused on what? Opposed to what? “All over the place”?

Not sure what that is, and probably read-out, rather than input.

Not sure if that has anything to do with longitudinal patient data.

What’s that?

We got that.

Again, not sure belongs here.

Well, bring it on. We tend to not like EAV structures like that. Instead, you might just make a proposal to add:

PROVIDER

  • valid_start_date
  • job_title_concept_id

CARE_SITE

  • size_concept_id
  • teaching_indicator

LOCATION

  • rurality_concept_id

@Christian_Reich

Regarding hospital attribute use case, I will ask client’s permission to post it. As for other attributes, we can discuss in detail what their definitions are and how they have bee used in clinical research.

But the main point I want to bring up is that by adding this table (Dimension_Attribute) to the CDM, we have a holistic approach to all the dimension attribute issues listed out there, such as the ones below:

Region_concecpt_id - hotly debated right now.

Specialty & Add Clinical Title #49

Multiple Addresses per Provider #48

These are just a few examples. I am sure in the future people will want to add more dimension attributes to the existing dimension tables. By adding this new table, we provide a solution to all.

@QI_omop:

Very well understood. And the conversation is not new. Early on, during OMOP times, we had lengthy discussions on how much we want to be EAV (allowing any concept to be represented any number of times, i2b2 does it that way) versus explicit fields with detailed definitions. We decided for the latter. The former has the advantage of greater flexibility, as you properly pointed out. But there is a huge cost: Before you can run an analytic you have to query the data to infer the “local actual model”, so to speak.

Let’s take your example: Multiple addresses per provider. For the ETLer it would be a huge relief not to have to figure out which location to pick per provider, since the providers legitimately can have more than one Care Site. For example, the surgeon can work in an office or an ambulatory surgery, but can also be an attendant in a hospital ward. For the analytic, this is a huge mess. It is impossible to figure out at analysis time what to do with those 3 locations. In most cases, they are very close to each other anyway, so the usual use case of distance between the patient’s home and the provider location will result in three almost identical distances. In other words, this detail is useless. Therefore, we decided to remove the location from the provider altogether. This will put more pressure on figuring out the right location for the Care Site, but I think this will result in an improved CDM dataset.

Happy to engage more in this debate.

t