Why there are duplicate concept_ids in CONCEPT.csv?

Proust_Marcel · December 23, 2020, 2:32am

I downloaded vocabularies from Athena and I am trying to understand the data model by studying the data in the CONCEPT.csv file.
As far as I know, the concept_id shall be “the A unique identifier for each Concept across all domains.”, but why there are duplicate concept_ids for different concept?

cat …/refer_all/CONCEPT_sort.csv | grep 73574 | awk ‘$1=“73574” {print $0}’

And by the way, why the csv file is not comma separated? It’s difficult to parse the data and insert into my DB.
Thanks for concerning.

Proust_Marcel · December 23, 2020, 3:08am

Seem some thing wrong with my grep grammar, I’d better do more investigation. Thanks

Christian_Reich · December 23, 2020, 12:30pm

@Proust_Marcel:

Good name. But maybe not for a man full of action.

Why don’t you look at the description or put it into a database using the DDL of the various flavors, rather than reverse engineering the table?

mik · December 23, 2020, 1:45pm

Welcome @Proust_Marcel !
You probably by now found out that indeed your finding of duplicate concept IDs was not entirely correct. The first one was indeed 73574 but already the next one contained a substring of it as it was 973574.
I go along with @Christian_Reich in suggesting to first explore the data model by using the documentation provided.
Whenever you find something in the vocabularies that smells fishy, cannot be explained by the existing documentation and you did not find it mentioned in the vocabulary forum posts or in the github as an open issue, please do not hesitate to create a forum post (or a github issue right away) to let us know!
Cheers ~ Mik

Proust_Marcel · December 24, 2020, 10:52am

Thank you. Actually I read the documentation several times and I need to read it more for a better understanding. In my case, we are using MySQL, and some additional columns need to be added. I am trying to figure out a best way to use the OHDSI system.

Another question, the concept_ancestor has more than 5000_0000 entires, do I need to split it into several tables?

MaximMoinat · December 24, 2020, 10:44am

Typically, all these records are contained in one table. If you are very interested in optimizing queries on the concept_ancestor table, you might consider partioning it. I do not have an example of this though.

I agree, this is confusing! The solution is simple though, the files should have the .tsv extension. @Christian_Reich @mik Is there any willingness in the vocab team to change this? Or is this a thing that we are just stuck with to not break anything?