OHDSI Home | Forums | Wiki | Github

Newbie Vocabulary import

@Christian_Reich
Well, you wouldn’t have Domain_id, Vocabulary_id, or Concept_class_id in the Concepts table. The relational integrity would be maintained by the Junction table(s). You don’t add a domain into the concepts table, you add it to the concept_domain table, and you can’t add a record to the concept_domain table unless there is a related record in the domain table.

It adds a BUNCH of work to correct all the existing queries, but it’s the right way to model it.

But why not just load and then add the constraints? As you said, the data are coming from Athena (Pallas, really, where they live under the constraints) and you load them only once. After that, it’s read-only. Why would we want to do such a big surgery?

Because @Christian_Reich

  1. It’s the correct way to model M2M relationships

  2. The whole Remove Constraints, Load Data, Add Constraints method is a workaround for an incorrect data model.

  3. You should only modify your database structure when the data model changes. You shouldn’t have to change your database structure in order to load data. That’s like saying you need to take out a wall of your house every time you want to move in furniture. Much better to build the house correctly in the first place.

  4. The Drop/Add method of loading Vocabularies is fine for development, but it is not the right way to operationalize it. Eventually, I’ll want to do some sort of incremental load which recognizes the differences. I can’t do that with the current model.

  5. Right now, I have full control of our OMOP databases, which allows me to Alter tables. However, when it’s operationalized, our IS department is going to have to be involved if it requires structural changes. That’s going to be a huge issue if the Vocabulary is updated multiple times a year.

  6. In my experience, compromises in the data modeling have short-term advantages, but eventually, you’re going to regret it. I think that as OMOP grows in acceptance, these issues are going to resurface, and it would be better to fix it now rather than later.

Hey, roger, I’m a little confused at this statement.
A domain has many concepts, but a concept does not have many domains. So it’s a Many-to-One relationship between concept and domain.

Your proposal of JCT tables implies this M2M relationship. But is that actually the case here?

@Chris_Knoll

You’re absolutely right. And maybe that’s the problem, because if you look at the diagram above, and at the SQL statements below:

ALTER TABLE omop.concept
ADD CONSTRAINT fpk_concept_domain
FOREIGN KEY (domain_id)
REFERENCES omop.domain (domain_id);

ALTER TABLE omop.domain
ADD CONSTRAINT fpk_domain_concept
FOREIGN KEY (domain_concept_id)
REFERENCES omop.concept (concept_id);

There is clearly a M2M relationship already physically existing in the database.

Maybe problem is not (as I thought) an incorrect modeling of an M2M relationship. It’s that it is an incorrect model of a 1:M relationship. As I look at it, there is no reason to have Domain_id in the Concept table. The Domain_id can be derived by joining the Concept table to the Domain table based on concept_id – domain_concept_id.

The same is true of vocabulary_id and concept_class_id.

But that’s what happened in my house: They had to get the big Subzero fridge through the window using a crane, because it wouldn’t fit through the staircase. :slight_smile:

Well, you hit the nail on the head. That is the correct way. Two reasons it isn’t that way now:

  • We added concepts for Domains, Vocabularies etc. only in V5 after the fact, because folks where complaining about an insufficient information model
  • We don’t like numeric identifiers for reference type information like domain, vocabulary, concept class, relationship_id. We had them in V4, and it makes interactive work with the vocab data very difficult (you have to constantly join, or know those IDs by heart)

Wait, wait. You better tell them now. The vocabulary is being updated all the time. You will have to reload, and that’s not an exception, but the norm.

Understood. The compromise has lived for quite some time, now, but there is no reason we would not improve it. Do you want to make a proposal to the CDM WG?

Hey, @roger.carlson,

I really appreciate the discussion we’re having on this, and I’m not making any claim that anyone is right or wrong here, I just wanted to make sure I understood what the model intended to represent, and what your proposal intended to represent.

Not necessarily, it would depend on the structure of the table that is reference in the foreign key constraint (ie: is it an entity table or a junction table)

If we forget about the domain’s dependency back on concept, and instead thought about an entity ‘foo’ which domain depended on, and then concept depended on domain, you wouldn’t have a problem if there was FKs as:

foo <- domain <- concept

where the arrow represents a FOREIGN KEY reference between the tables.

The physical data model would be that in the domain table, there’s a foo_id column to represent many domains have exactly one foo (many-to-one), and in the concept table there’s a domain_id column representing the many concepts have exactly one domain (another many to one). So, having a column attached to the entity domain and concept to capture the many to 1 relationship I think is proper. You’d agree? Loading this is easy, you load the foos, you load the domains, you load the concepts.

But, things get wonky when we toss out the idea of a domain associated with foo and instead say domains associated to concepts. That’s the Catch-22 you’re describing, you can’t define the concept without a domain, and you can’t define the domain without the concept.

What is needed is a way that you can define a concept independent of the dependency on domain, and a domain without the dependency on concept. To do this with the current model, you’d insert the data with NULL as the foreign key column (FKs don’t check nulls) and then do a second pass to associate the domains to their concepts via UPDATE. To do this with the model you propose (where you’ve removed the association between domain and concept from the core table), similar steps are required to introduce the data: Insert the presence of the concept in concept, insert the presence of the domain in domain, insert the association between concept and domain into the junction table via INSERT. Your model’s form works better for bulk-load operations: you load concept, load domain, load concept_domain association. And, to make it a 1 to many relation, you would make concept_id unique constraint in the concept_domain_jct table such that a concept can only appear once in it, and therefore a concept can only have an association to a single domain.

However, this suffers a run-time penalty by additional joins. The benefit of the FK in the current form is to assist with join performance: the query optimizer can make some judgments on which tables to join first if it knows that a FK exists between join columns. That’s what we have today with the FK between domain and concept. The downside is you need to either drop constraints as you load data or load data with nulls and do a 2 pass update to associate domains to concepts.

Yes, for the same wonky reason: vocabulary entity associates back to concept, but you can’t have a concept without associating (one to many) to a vocabulary.

So, In conclusion, the circular foreign key dependency between domain and concept is awkward, but from the context of how one-to-many relationships should be modeled, I think it is proper. Just the added wrinkle that the one-to-many relationships depend on each other makes it awkward to manipulate. Certain DBMS (like oracle) support the notion of ‘deferred constraints’ allowing you to update both tables in a transaction and the constraint is checked after. SQL Server is not one of those platforms, so you’d drop your constraints before the load.

In your case, I don’t see reloading vocabulary tables as an ‘online’ operation. You can’t have queries running through the system while the vocab tables are being rebuilt (if you are dropping them and bulk loading). I feel you should go with the option of disabling the constraints during loading (this is not a complex operation in SQL Server). I’m sorry if that doesn’t match your vision, but refactoring those tables into a form that lets you have an easier time bulk-loading at the cost of additional joins and potential query optimization loss isn’t a trade-off I’d really favor.

Even though this is a cycle of table references, are there really cycles in this data? I presumed the domain_concept is the concept of the domain, so, it’s just going up the tree in a hierarchy. That the hierarchy is spread over two separate tables and bounces back and forth makes things a bit more fun, but hierarchies can be like that.

Perhaps, since this is a loading issue, what be helpful is to have is a vocabulary syncronization program which sees what’s there, reads in what is suppose to be there, and then, based upon the row-level dependencies, creates the necessary transaction (interleaving rows from both tables, as necessary) so that the terminology tables match what they are declared to be. Anyway, I don’t see this as a hard problem.

Currently, I have the Vocabulary data in which (Dm+D) terminology is included on Postgres.
When I creating the constraints, I got like this error code.
“(dm+d) is not present in xx_table.”
My plan is to import new vocab data from Athena for solving that error, and what I’ve considered is that I should reload all the data which are related with concept table if the concept is associated with other data as Clinical data (Condition occurrence, Drug_exposure, measurement, Procedure_occurrence …etc).
If the standardized vocabularies should be replaced to new one, should I reload all the things of CDM?

@Chris_Knoll @Christian_Reich,

Thanks for taking the time to address this with me. It’s very helpful that I can work through issues like this out loud. I’m not trying to position myself as an expert. I’ve been dealing with relational models for 25 years. I’ve only been involved with OMOP for a year. I ran into similar cognitive dissonance when I had to learn dimensional modeling. Some things are allowed (are even encouraged) in analytical processing that are not allowed in transactional processing.

What you’ve both said makes a lot of sense, even though it still offends my sensibilities. I’m glad to learn that there’s a rationale for it. Given that, I really don’t have a viable alternative model at this time.

There are 3 possible avenues I’m considering going forward:

  1. Continue to remove then re-add constraints, as I’ve been doing.

  2. Remove only some of the problematic constraints and load the data in a particular order and then re-add them.

  3. Permanently remove one of each of the circular foreign key constraints and rely on OHDSI to build the vocabularies correctly.

I’m not seeing a huge downside to #3. Can you think of any?

1 Like

Hey, @roger.carlson, it’s always good to discuss these considerations with experienced professionals, such as yourself, and I share your impression that it’s a situation that isn’t what I’d like to see in a data model. But at least there is some reasoning behind it, so that also makes me relieved.

#3 seems viable to me. There’s always the option to create a view to the ‘official’ CDM vocbulary schema spec, that is built off of an underlying data-model which suits your own preferences. So, what would have to happen on your side is you take the single concept.csv file and split it into the entity and junction tables that can have the constraints you favor, and then create the view that maps those junction tables into the form that the CDM vocabulary schema spec calls for. Then, the standardized queries continue to work, and your loading operations do not require any constraints dropped. You may also be able to test relative performance.

But if you can live with #3, that’s a lot less work for you.

-Chris

1 Like

Just to ask,

Does this still work?

I also wanted to have a vocabulary list in our local database, and it is also postgreSQL. It seems what I should do is:

(1) Download (perhaps as much as possible) the vocabulary list from Athena
(2) Follow the instruction as above

And it is done (???)


The reason I asked this is that, I have found another tutorial on the ODHSI wiki which help builds the OMOP vocabulary list in Oracle 12 DBMS. That tutorial is soooo long makes me want to give up.

Compare to that tutorial with Oracle 12 DBMS, the one in postgreSQL is too good to be true. So much easier.

Hello @lychenus,

This is a) way out of date; b) not necessary because the vocabularies are built by OMOP Vocabulary team (led by @mik) and available for everyone to download. So you can just download files from Athena (requires an account, free to create) and load them into DB of your choice.
Vocabularies in Athena are updated on a regular basis, you can track the updates here: Releases · OHDSI/Vocabulary-v5.0 · GitHub.

Hi @lychenus!
the link to Athena that you have used is not the best one. The production server download page is here.
On this download page (you need to log in), pick the vocabularies that you need (and in most cases it is a good idea to keep the ones that are marked as default when you enter that page). Athena will create a download bundle for your with the most current vocabularies.
Then use the VocabImport folder (here is the one for PostGreSQL) and run the import into your database by making use of the scripts (or take them as inspiration for your own script). You can probably also use other means of import tools.
If you still have questions, don’t be shy and let us know. Maybe it is time, we build a vocabulary import description at some prominent place that is up to date and you could be the one giving us the right input.

… and thanks, @rookie_crewkie , for stepping in.

btw, following the postgreSQL tutorial, i am on to step #4

Execute the script OMOP CDM postgresql pk indexes.txt to add the minimum set of indexes and primary keys we recommend.

Just interested, how long does it take to run all the query at once? I tried to do that on pgadmin4, the GUI just sort of ‘freezed’ and the circle keeps spinning. Do I have to use a CLI for that and in your experience how long does it take?

Or should I run the query line by line?

I have tried some of them line by line. Some lines are instant, but some lines take 5 minutes to run.

Thanks

The VocabIMport folder is no longer available in the OHDSI CDM github repo . Can you help with that please? I am trying to load CDM5 vocabulary data into postgres

Can you explain where you are in the process? Have you created the vocabulary tables? Have you downloaded the vocabulary data from Athena? What is holding you up? What documentation are you looking at for loading the vocabulary?

I have already downloaded the vocabulary data from athena in csv format. I have created the tables in postgres using scripts from this repository: https://github.com/OHDSI/CommonDataModel/blob/v5.4.0/inst/ddl/5.4/postgresql/OMOPCDM_postgresql_5.4_ddl.sql. I am getting errors like “missing data in column” when I am importing the data into the tables. I am not using any documentation.

Sorry to keep asking questions rather than providing answers, but can you give an example of the errors you are seeing when trying to load the vocabulary tables.

error

t