OHDSI Home | Forums | Wiki | Github

Why there are duplicate concept_ids in CONCEPT.csv?

(Proust Marcel) #1

I downloaded vocabularies from Athena and I am trying to understand the data model by studying the data in the CONCEPT.csv file.
As far as I know, the concept_id shall be “the A unique identifier for each Concept across all domains.”, but why there are duplicate concept_ids for different concept?

cat …/refer_all/CONCEPT_sort.csv | grep 73574 | awk ‘$1=“73574” {print $0}’

And by the way, why the csv file is not comma separated? It’s difficult to parse the data and insert into my DB.
Thanks for concerning.

(Proust Marcel) #2

Seem some thing wrong with my grep grammar, I’d better do more investigation. Thanks

(Christian Reich) #3


Good name. But maybe not for a man full of action. :slight_smile:

Why don’t you look at the description or put it into a database using the DDL of the various flavors, rather than reverse engineering the table?

(Michael Kallfelz) #4

Welcome @Proust_Marcel !
You probably by now found out that indeed your finding of duplicate concept IDs was not entirely correct. The first one was indeed 73574 but already the next one contained a substring of it as it was 973574.
I go along with @Christian_Reich in suggesting to first explore the data model by using the documentation provided.
Whenever you find something in the vocabularies that smells fishy, cannot be explained by the existing documentation and you did not find it mentioned in the vocabulary forum posts or in the github as an open issue, please do not hesitate to create a forum post (or a github issue right away) to let us know!
Cheers ~ Mik

(Proust Marcel) #5

Thank you. Actually I read the documentation several times and I need to read it more for a better understanding. In my case, we are using MySQL, and some additional columns need to be added. I am trying to figure out a best way to use the OHDSI system.

Another question, the concept_ancestor has more than 5000_0000 entires, do I need to split it into several tables?

(Maxim Moinat) #6

Typically, all these records are contained in one table. If you are very interested in optimizing queries on the concept_ancestor table, you might consider partioning it. I do not have an example of this though.

I agree, this is confusing! The solution is simple though, the files should have the .tsv extension. @Christian_Reich @mik Is there any willingness in the vocab team to change this? Or is this a thing that we are just stuck with to not break anything?