OHDSI Home | Forums | Wiki | Github

What is the recommended way to load vocabulary into Postgres?

Hi folks,

This may be very basic question but what is the recommended way to load vocabulary data from the csv files downloaded from athena.ohdsi.org into a postgres database?

I know there was a sql script created by Lee Evans using ‘copy from’ in 2014. Is there any other/better /recommended way to do it?

Thanks,
Ray Zohdsi

You can use the Data Description Language (DDL) scrips to create the tables correctly (https://github.com/OHDSI/CommonDataModel/tree/main#readme). From there, loading the vocabularies from Athena should be a mere matter of inserting rows on the tables… although the CSV files can be big.

The ETL-Synthea repo (https://github.com/OHDSI/ETL-Synthea) has a function to load those CSV files (LoadVocabFromCsv()), but I wonder if the same could be accomplished with import features from general-purpose SQL clients, e.g. Import/Export Data Dialog — pgAdmin 4 6.7 documentation.

Kind regards.

Thank you so much, Fabrício! This is what I have been looking for.

Ray

I use ETL-Synthea to import and got the warning message:

Warning messages:
1: In data.table::fread(file = paste0(vocabFileLoc, "/", csv), stringsAsFactors = FALSE,  :
  Found and resolved improper quoting out-of-sample. First healed line 853636: <<2614984        "y set" tubing for peritoneal dialysis  Device  HCPCS   HCPCS   S       A4719   20020101        20991231        >>. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.
2: In data.table::fread(file = paste0(vocabFileLoc, "/", csv), stringsAsFactors = FALSE,  :
  Found and resolved improper quoting out-of-sample. First healed line 115727: <<44833612       "ventilation" pneumonit 4180186>>. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.
>

The function data.table::fread() is having trouble with quotes. Disabling them (quote = “”) should fix. The vocabulary files do not use quoting.

Sorry Fabricio,
I thought I knew how to use it but I don’t. How to pass the quote="" to disable them?

Thanks,
Ray Z.

Zohdsi,

Truth is, I confused myself in the answer too, sorry. I was talking about fread() instead of LoadVocabFromCsv().

What worked for me was to copy the code from LoadVocabFromCsv(). If you just enter this in the R console:

> ETLSyntheaBuilder::LoadVocabFromCsv

You get the code of the function:

function (connectionDetails, cdmSchema, vocabFileLoc, bulkLoad = FALSE)
{
csvList ← c(“concept.csv”, “vocabulary.csv”,
“concept_ancestor.csv”, “concept_relationship.csv”,
“relationship.csv”, “concept_synonym.csv”,
“domain.csv”, “concept_class.csv”, “drug_strength.csv”)
fileList ← list.files(vocabFileLoc)
fileList ← fileList[which(tolower(fileList) %in% csvList)]
conn ← DatabaseConnector::connect(connectionDetails)
for (csv in fileList) {
vocabTable ← data.table::fread(file = paste0(vocabFileLoc,
“/”, csv), stringsAsFactors = FALSE, header = TRUE,
sep = “\t”, na.strings = “”)
vocabTable ← as.data.frame(vocabTable)
if (tolower(csv) == “concept.csv” || tolower(csv) ==
“concept_relationship.csv” || tolower(csv) ==
“drug_strength.csv”) {
vocabTable$valid_start_date ← as.Date(as.character(vocabTable$valid_start_date),
“%Y%m%d”)
vocabTable$valid_end_date ← as.Date(as.character(vocabTable$valid_end_date),
“%Y%m%d”)
}
writeLines(paste0("Loading: ", csv))
suppressWarnings({
DatabaseConnector::insertTable(conn, tableName = paste0(cdmSchema,
“.”, strsplit(csv, “[.]”)[[1]][1]),
data = as.data.frame(vocabTable), dropTableIfExists = FALSE,
createTable = FALSE, useMppBulkLoad = bulkLoad,
progressBar = TRUE)
})
}
on.exit(DatabaseConnector::disconnect(conn))
}
<bytecode: 0x000001eacefcb118>
<environment: namespace:ETLSyntheaBuilder>

So you can copy the code to create your own function equal to LoadVocabFromCsv(), but with quote = "" in the call to data.table::fread().

Hope that works for you.

Kind regards.

This works too:

t