Problem of the total number of observation in RxNorm vocabulary

Inho · December 16, 2021, 8:30am

Hello all,

I am checking RxNorm and RxNorm Extension vocabularies to compare with my list of drugs.
I got those lists from ATHENA and the site of ATHENA showed that the number of RxNorma and RxNorm Extension are 300,300 and 2,091,918, respectively.

I tried to import each CONCEPT.csv file in R, but it prints a different output.

> fileName <- "./CONCEPT.csv"
> RxNorm1 <- read.csv(fileName, header=TRUE, sep="\t")
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
  EOF within quoted string
> RxNorm2 <- read.csv(fileName, header=TRUE)
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
  EOF within quoted string
> str(RxNorm1)
'data.frame':	86565 obs. of  10 variables:
 $ concept_id      : int  1146945 1146954 1147044 756315 756316 756317 756318 756319 756320 756321 ...
 $ concept_name    : chr  "concept.concept_id" "concept.invalid_reason" "observation_period.observation_period_id" "metadata.metadata_type_concept_id" ...
 $ domain_id       : chr  "Metadata" "Metadata" "Metadata" "Metadata" ...
 $ vocabulary_id   : chr  "CDM" "CDM" "CDM" "CDM" ...
 $ concept_class_id: chr  "Field" "Field" "Field" "Field" ...
 $ standard_concept: chr  "S" "S" "S" "S" ...
 $ concept_code    : chr  "CDM1" "CDM10" "CDM100" "CDM1000" ...
 $ valid_start_date: int  20141111 20141111 20141111 20210925 20210925 20210925 20210925 20210925 20210925 20210925 ...
 $ valid_end_date  : int  20991231 20991231 20991231 20991231 20991231 20991231 20991231 20991231 20991231 20991231 ...
 $ invalid_reason  : chr  "" "" "" "" ...
> str(RxNorm2)
'data.frame':	86637 obs. of  1 variable:
 $ concept_id.concept_name.domain_id.vocabulary_id.concept_class_id.standard_concept.concept_code.valid_start_date.valid_end_date.invalid_reason: chr  "1146945\tconcept.concept_id\tMetadata\tCDM\tField\tS\tCDM1\t20141111\t20991231\t" "1146954\tconcept.invalid_reason\tMetadata\tCDM\tField\tS\tCDM10\t20141111\t20991231\t" "1147044\tobservation_period.observation_period_id\tMetadata\tCDM\tField\tS\tCDM100\t20141111\t20991231\t" "756315\tmetadata.metadata_type_concept_id\tMetadata\tCDM\tField\tS\tCDM1000\t20210925\t20991231\t" ...

RxNorm1 is executed with sep="\t", but RxNorm2 is not.
You can see the different numbers of its observation.
RxNorm1 : 86,565
RxNorm2 : 86,637

My questions are:

Why the number between ATHENA site and R result is different?
Why do I get the different numbers from the same raw data in R according to sep option?

And, I also want to know the right R code to import RxNorm data precisely.

Thank you in advance!

Eduard_Korchmar · December 16, 2021, 9:36am

Hello! The warning message “EOF within quoted string” could hint at the problem. It is possible that one of the characters is treated as end of file and cuts the imported file short. Could you please share what is the last entry in the dataframe?

Inho · December 17, 2021, 1:36am

@Eduard_Korchmar

Thank you for your answer.
I am a beginner in R, so I could not catch the meaning of the error message.
I searched how to modify my original code according to your answer, and I found the solution.
I added an option, quote="", then the error message did not be displayed.

But I still have curiosity why the numbers of observations in each data frame are different.

I attached the tail of the data frame.
In the case of no option(=RxNorm2), tab delimiter(\t) stays in the value, but it has more observations than the data of the tab option.

RxNorm1 seems perfect, but I cannot go to the next step due to the RxNorm2 problem.
Please let me know if I could not understand the vocabulary data.

I attached my result of R code as below:

> fileName <- "./CONCEPT.csv"
> RxNorm1 <- read.csv(fileName, header=TRUE, sep="\t", quote="")
> RxNorm2 <- read.csv(fileName, header=TRUE, quote="")

> str(RxNorm1)
'data.frame':	508112 obs. of  10 variables:
 $ concept_id      : int  1146945 1146954 1147044 756315 756316 756317 756318 756319 756320 756321 ...
 $ concept_name    : chr  "concept.concept_id" "concept.invalid_reason" "observation_period.observation_period_id" "metadata.metadata_type_concept_id" ...
 $ domain_id       : chr  "Metadata" "Metadata" "Metadata" "Metadata" ...
 $ vocabulary_id   : chr  "CDM" "CDM" "CDM" "CDM" ...
 $ concept_class_id: chr  "Field" "Field" "Field" "Field" ...
 $ standard_concept: chr  "S" "S" "S" "S" ...
 $ concept_code    : chr  "CDM1" "CDM10" "CDM100" "CDM1000" ...
 $ valid_start_date: int  20141111 20141111 20141111 20210925 20210925 20210925 20210925 20210925 20210925 20210925 ...
 $ valid_end_date  : int  20991231 20991231 20991231 20991231 20991231 20991231 20991231 20991231 20991231 20991231 ...
 $ invalid_reason  : chr  "" "" "" "" ...

> str(RxNorm2)
'data.frame':	519476 obs. of  1 variable:
 $ concept_id.concept_name.domain_id.vocabulary_id.concept_class_id.standard_concept.concept_code.valid_start_date.valid_end_date.invalid_reason: chr  "1146945\tconcept.concept_id\tMetadata\tCDM\tField\tS\tCDM1\t20141111\t20991231\t" "1146954\tconcept.invalid_reason\tMetadata\tCDM\tField\tS\tCDM10\t20141111\t20991231\t" "1147044\tobservation_period.observation_period_id\tMetadata\tCDM\tField\tS\tCDM100\t20141111\t20991231\t" "756315\tmetadata.metadata_type_concept_id\tMetadata\tCDM\tField\tS\tCDM1000\t20210925\t20991231\t" ...

> tail(RxNorm1)
       concept_id                                                             concept_name domain_id vocabulary_id concept_class_id
508107      32916                      The Clinical Knowledgebase (The Jackson Laboratory)  Metadata    Vocabulary       Vocabulary
508108      32917                                NCI Thesaurus (National Cancer Institute)  Metadata    Vocabulary       Vocabulary
508109      32771      CAP electronic Cancer Checklists (College of American Pathologists)  Metadata    Vocabulary       Vocabulary
508110      32928 International Classification of Diseases, Tenth Revision, German Edition  Metadata    Vocabulary       Vocabulary
508111      32929                                    Diagnostic Modifiers of Cancer (OMOP)  Metadata    Vocabulary       Vocabulary
508112      32999                                            Oncology Knowledge Base (MSK)  Metadata    Vocabulary       Vocabulary
       standard_concept   concept_code valid_start_date valid_end_date invalid_reason
508107                  OMOP generated         19700101       20991231               
508108                  OMOP generated         19700101       20991231               
508109                  OMOP generated         19700101       20991231               
508110                  OMOP generated         19700101       20991231               
508111                  OMOP generated         19700101       20991231               
508112                  OMOP generated         19700101       20991231               

> tail(RxNorm2)
              concept_id.concept_name.domain_id.vocabulary_id.concept_class_id.standard_concept.concept_code.valid_start_date.valid_end_date.invalid_reason
519471 32771\tCAP electronic Cancer Checklists (College of American Pathologists)\tMetadata\tVocabulary\tVocabulary\t\tOMOP generated\t19700101\t20991231\t
519472                                                                                                      32928\tInternational Classification of Diseases
519473                                                                                                                                       Tenth Revision
519474                                                             German Edition\tMetadata\tVocabulary\tVocabulary\t\tOMOP generated\t19700101\t20991231\t
519475                               32929\tDiagnostic Modifiers of Cancer (OMOP)\tMetadata\tVocabulary\tVocabulary\t\tOMOP generated\t19700101\t20991231\t
519476                                       32999\tOncology Knowledge Base (MSK)\tMetadata\tVocabulary\tVocabulary\t\tOMOP generated\t19700101\t20991231\t

Eduard_Korchmar · December 17, 2021, 8:18am

According to R documentation, if no value is provided to sep parameter, R treats every flavor of whitespace character as separator, which by the way is really weird behaviour. It is not impossible that some concept names from different sources contain different whitespace characters naturally, as OMOP takes different vocabularies from many sources in. Best practice would be to always specify the separator.

In any case, there is still too few of rows to contain RxNorm or RxNorm Extension. How many rows are in the CONCEPT.CSV file? Did you include RxNorm and RxNorm Extension vocabularies in the download request on this page?

Inho · December 19, 2021, 11:31pm

@Eduard_Korchmar

Under the well-separated variable condition, I import that RxNorm has 508,112 obs and RxNormExtension has 2,299,730 obs.
I downloaded these datasets from the page you referred to. However, as I first asked the Athena page shows the total number of obs are 300,300 and 2,091,918 obs, respectively.

I still cannot know the genuine number of observations of RxNorm and RxNormExtension, but I have checked all of my drug datasets is included the vocabulary of RxNorm and RxNormExtension.

Some error is cleared, but I can say the problem has not been fully worked out.

zhuk · March 16, 2022, 7:05am

Hello, @Inho

Does the problem persist?