Seeking Assistance: Loading Vocabularies from Athenn in Bulk - Character Length Error

Jason_Li · May 26, 2023, 2:17am

I’m currently facing an issue while attempting to load two tables, namely concept_synonym and concept, from Athenn in bulk using the default pgAdmin data loader. Specifically, I’m encountering a “character too long” error, and I’m seeking your guidance and expertise to overcome this obstacle. Allow me to provide more details about the problem I’m experiencing.

When trying to load the mentioned tables, it appears that some data (maybe from some dictionaries) within them are not separated by tabs (\t) as expected. As a result, the data loader is throwing a “character too long” error, indicating that the character length exceeds the maximum limit allowed.

I have conducted some initial research and investigation on my own, but I haven’t been able to find a suitable solution yet. Therefore, I’m turning to this knowledgeable and supportive community for any advice, insights, or alternative approaches you might have.

Sanjay_Udoshi · May 26, 2023, 4:40pm

Hi Jason,

Welcome to the community! I faced this issue as well when I began my experiments/development journey. I found that the standard tools for loading vocabulary data, using the db admin tools like SSMS for MS Sql, as well as pgAdmin 4, did not handle large imports very well at all.

So I turned to dbeaver (https://dbeaver.io/). This has an import wizard that allows you to specify the delimiter type as well as “escape characters” etc. I was successful with this for some time.

However the best tool for loading the vocabulary data is actually Apache Solr. You can also look for vocab loading scripts in the OHDSI GitHub repo.

If you are familiar with docker containers, please use Broadsea 3.0, which has Apache Solr as a profile to load the vocab files - instructions are here: https://github.com/OHDSI/Broadsea

If you need further assistance, you can visit our discord channel: https://discord.gg/xABFWShJYx

-S