OHDSI Home | Forums | Wiki | Github

NDC Searching in Athena

I have an Rx claims dataset where I cannot match 75% of the 11-digit NDCs to the OMOP CDM when I try to use a simple join in SQL.

However, sometimes I search in Athena and find the NDC.

Here is one example where I searched ‘83730000626’ in Athena and found a concept with code ‘08373000626’. I’m confused by this example, because it’s not as simple as matching the first 9 digits in my search to the CDM concept.

Is this a matter of taking 0’s in selective places and placing them in other locations to improve the match? Is there an algorithm for this that Athena uses?

You should use the full 11 digit NDC for matching.
SELECT * from Concept where concept_code = ‘83730000626’

Thank you, that returns no results when I try that in SQL on the concept table. But it returns a result when I type it into Athena. My question is why.

The only possible reason is that your concept table is different from Athena. The above select works on my version of the concept table.
SELECT count(*)
FROM concept
WHERE vocabulary_id = ‘NDC’; Returns 1,079,362
What to you get?

I get 1,055,004 when I run that.

Did you run the query you suggested - SELECT * from Concept where concept_code = ‘83730000626’

What do you get?

My misunderstanding. There is no concept for the NDC code ‘83730000626’. I do not know how Athena works when there is no exact match, but Athena returned the best match it could find, which is the concept code ‘08373000626’. Someone more familiar with the Athena query is going to have to answer your question.

The github page tells you a bit about the search algorithms. Athena uses solr as the backbone for its search logic. There is a “Weight distribution” that follows this sequence: Concept name / Concept code / Synonyms / Concept ID.

Do you have the drug name associated with your NDC = ‘83730000626’? Does it match the drug name for NDC code = ‘08373000626’? If the names don’t match or you don’t have the name, then I wouldn’t assume these are the same drugs. NDC codes are notoriously dirty. You might want to talk with the person who provided the dataset.

Yeah. Athena doesn’t have the 83730000626. It tries to guess how you might have mistyped it (kind of like Google does with its “Do you mean…”)? We need to make the distinction between exact match and wild guess better. 083730000626, even though similar, is almost certainly something completely different.