Proposal to add a new table - Dimension Attribute table into CDM

QI_omop · April 13, 2019, 1:15am

I would like to bring this topic to the community to discuss.

My recent OMOP conversion source data is a data warehouse from one of the largest EHR vendors in US. Its data has some care site attributes such as Hospital bed size range (200-299, 300-499, 500+ etc.), Teaching hospital indicator, Rural-urban indicator and etc. I have asked them how are these data element used and they did provide me with use cases. Also I think attributes describing care site, location, and provider can be very useful in healthcare research. Some of the attribute examples are listed below and lots of them have been used in research literature:

PROVIDER

Years in practice
Years practicing in current setting
Doctor’s rating (WebMD, Healthgrades etc.)
Language spoken
Job tile (director, department chairman etc.)
NPI equivalent in other countries

CARE_SITE

Hospital bed size range (200-299, 300-499, 500+ etc.)
Teaching hospital indicator
Rural-urban indicator
Practice type (solo practice, episodic care practice/walk-in clinic etc.)
Total number of clinical and administrative staff
Number of patient visits per week in primary practice
Focused practice scope (yes/no)

LOCATION

Rurality (metropolitan, suburban, rural etc.)
Accessibility to health care
Population density
Air Quality Index (AQI)
Average temperature
Annual days of sunlight exposure (ASE)
Region
International address format (District, Province, Region etc.)
Latitude
Longitude
Altitude (height above sea level)

And the list goes on and on. Currently our person centric OMOP CDM model does not support including these data elements. Now one choice to expand current dimension table structure by add all these attributes into their respective tables, i.e., Provider, Care_site and Location table. But that is disruptive and also not economical since in many cases, source data do not provide these information. So I am proposing to add a row based new table, called Dimension_Attribute (or any name you want to call it) as a comprehensive way to include these attributes in case source data does provide such information. It also preserves the original dimension table structure and is backward compatible. The table will have following columns:

Domain_id - This will have one of the 3 values: Provider, Care_Site or Location
Dimension_id - This is the same id number as in their respective dimension tables, i.e., Provider_id, Care_Site_id or Location_id
Attribute_concept_id - Cconcept_id for the attribute, e.g., 55556666 for Air Quality Index (AQI)
Attribute_source_value - Source value for the attribute, e.g., Air Quality Index
Value_concept_id – Concept_id for the value of the attribute, e.g., 22223333 for AQI 0 -50; 33334444 for AQI 51-100; 44445555 for AQI 201-300 etc.
Value_source_value – Source data value for the attribute, e…g, AQI 35, AQI 95, AQI 268 etc.
Attribute_dt – This is the date when attribute value was reported. It is optional.

For a detailed table structure and examples, please see the spreadsheet attached.

Dimension attribute.xlsx (16.7 KB)

Since the table is row-based and not column based, it retains the capability to add as many attributes as needed without disrupting table structure. As a general rule, the table should mostly contains value provided by source data asset, and not derived value. However, if absolutely needed, such as in the case of spatial epidemiology (Geospaital functionality in Atlas: integration of AEGIS · Issue #649 · OHDSI/WebAPI · GitHub), this table can also accommodate derived value as well.

I am not sure if this has been proposed before. If yes, I apologize for not acknowledging you but I am not able to find it. Please feel free to give your opinion on this.

Thanks,

Christian_Reich · April 13, 2019, 9:55pm

QI_omop:

Good stuff. We should think about these. But let’s tidy things up a little bit.

Can you share, please?

Obviously, in longitudinal data this won’t work, because it will change. Every year it will change by a year. So, if anything we would have date_of_graduation or something.

Not sure. Those tend to be the things we call “evidence”, and we would want to produce. What WebMD etc. have is called “hearsay” or “rumor”.

Can you conceptualize that? Do you have a good list?

Yeah, this is a bad term. What would you call it?

Sounds like something that belongs to Location. Plus, it’s not data, it’s reference. Because it doesn’t change in the time frames we are talking here.

That’s Visit in our lingo now. Used to be Place of Service. And it is not an attribute of the Care Site, because one Care Site can have many different constellations.

Sounds like a read-out, not reference data.

Not sure what that is. Focused on what? Opposed to what? “All over the place”?

Not sure what that is, and probably read-out, rather than input.

Not sure if that has anything to do with longitudinal patient data.

What’s that?

We got that.

Again, not sure belongs here.

Well, bring it on. We tend to not like EAV structures like that. Instead, you might just make a proposal to add:

PROVIDER

valid_start_date
job_title_concept_id

CARE_SITE

size_concept_id
teaching_indicator

LOCATION

rurality_concept_id

QI_omop · April 19, 2019, 3:39pm

@Christian_Reich

Regarding hospital attribute use case, I will ask client’s permission to post it. As for other attributes, we can discuss in detail what their definitions are and how they have bee used in clinical research.

But the main point I want to bring up is that by adding this table (Dimension_Attribute) to the CDM, we have a holistic approach to all the dimension attribute issues listed out there, such as the ones below:

Region_concecpt_id - hotly debated right now.

github.com/OHDSI/CommonDataModel

Location table changes (add region_concept_id)

opened 06:28PM - 13 Mar 19 UTC

closed 04:32PM - 01 May 23 UTC

pavgra

Accepted

## Location table changes (add region_concept_id) **Proposal Owner:** Pavel G…rafkin, Gowtham Rao **Discussion:** https://github.com/OHDSI/CommonDataModel/issues/220, https://github.com/OHDSI/WebAPI/issues/649 **Proposal overview:** - Add `region_concept_id` column to `location` table **Description** We would like to use administrative areas as part of cohort entry event, cohort inclusion criteria, enhance cohort characterization using the areas, create heatmaps and do clustering based on the areas. To achieve this, as a first step, Geo Vocabularies were proposed (https://github.com/OHDSI/WebAPI/issues/649#issuecomment-440757591) and implemented (https://github.com/OHDSI/Vocabulary-v5.0/pull/207). The geo vocabularies represent ontologies of administrative areas (e.g. Country -> State -> County -> Township). The second necessary step is linkage of records in `location` table to the administrative areas. Even though the relations of geo concepts and locations represent derived information, not all of OHDSI supported DBs have geo capabilities (and therefore we cannot compute the relations between a location and geo concepts inside DB - https://github.com/OHDSI/WebAPI/issues/649#issuecomment-440757591) plus the computation of the relations is pretty compute intensive to do it on-demand, therefore there is a need of pre-calculation and physical reference storage. We need to store only a link from a location to its lower-level administrative area (concept_id). Other, higher level administrative areas, can be retrieved via `concept_ancestor`. **Location to administrative area pre-calculation** *(using Postgres)* 1. Load a CSV with concept_id to polygon relations (https://github.com/OHDSI/Vocabulary-v5.0/pull/207#issuecomment-469859598) into the table: ``` CREATE TABLE area_polygon ( concept_id INTEGER, polygon VARCHAR ); ``` (where `polygon` stores GeoJSON) 2. Precalculate relations between LOCATION and lowest level areas: ``` CREATE INDEX area_polygon_geom ON area_polygon USING GIST (ST_SetSrid(ST_GeomFromGeoJSON(area_polygon.polygon), 4326)); CREATE INDEX location_point ON location USING GIST (ST_SetSrid(ST_MakePoint(longitude, latitude), 4326)); WITH lowest_level_areas AS ( SELECT area.* FROM area_polygon area WHERE NOT EXISTS (SELECT * FROM concept_ancestor WHERE ancestor_concept_id = area.concept_id AND min_levels_of_separation > 0) ), loc_area AS ( SELECT l.location_id, area.concept_id FROM location l JOIN lowest_level_areas area ON ST_CONTAINS(ST_SetSrid(ST_GeomFromGeoJSON(area.polygon), 4326), ST_SetSrid(ST_MakePoint(l.longitude, l.latitude), 4326)) ) UPDATE location SET region_concept_id = la.concept_id FROM loc_area la WHERE location.location_id = la.location_id; ``` **Proposed implementation** https://github.com/OHDSI/CommonDataModel/pull/251

Specialty & Add Clinical Title #49

github.com/OHDSI/Themis

Specialty & Add Clinical Title

opened 03:44PM - 27 Nov 18 UTC

closed 04:50PM - 05 Jul 23 UTC

ericaVoss

TYPE | NOTES -- | -- ITEM | Provider specialty not well captured in the Vocabu…lary. FORUM POST | http://forums.ohdsi.org/t/provider-specialty-code-set-clean-up/3888 <br> <br> http://forums.ohdsi.org/t/care-sites-and-specialty-specialty-code-clean-up/4538/4<br> <br> http://forums.ohdsi.org/t/new-comprehensive-hierarchy-for-providers-visits-and-place-of-service-specialty-care-site/5633 SOLUTION | Incorporate NUCC, ABMS, HES Specialty, and Specialty-CMS. The order of preference should be OMOP->Specialty->Place of Service->NUCC->ABMS->HES Specialty->UB04 NEXT STEPS | Ultimately the Vocabulary team will work on. Double check this has been implemented in the Vocabulary. Maybe we need to remove duplicates.

Multiple Addresses per Provider #48
https://github.com/OHDSI/Themis/issues/48

These are just a few examples. I am sure in the future people will want to add more dimension attributes to the existing dimension tables. By adding this new table, we provide a solution to all.

Christian_Reich · April 19, 2019, 9:55pm

@QI_omop:

Very well understood. And the conversation is not new. Early on, during OMOP times, we had lengthy discussions on how much we want to be EAV (allowing any concept to be represented any number of times, i2b2 does it that way) versus explicit fields with detailed definitions. We decided for the latter. The former has the advantage of greater flexibility, as you properly pointed out. But there is a huge cost: Before you can run an analytic you have to query the data to infer the “local actual model”, so to speak.

Let’s take your example: Multiple addresses per provider. For the ETLer it would be a huge relief not to have to figure out which location to pick per provider, since the providers legitimately can have more than one Care Site. For example, the surgeon can work in an office or an ambulatory surgery, but can also be an attendant in a hospital ward. For the analytic, this is a huge mess. It is impossible to figure out at analysis time what to do with those 3 locations. In most cases, they are very close to each other anyway, so the usual use case of distance between the patient’s home and the provider location will result in three almost identical distances. In other words, this detail is useless. Therefore, we decided to remove the location from the provider altogether. This will put more pressure on figuring out the right location for the Care Site, but I think this will result in an improved CDM dataset.

Happy to engage more in this debate.