@dantohe - welcome to OHDSI!
With Truven MarkeScan, being one of the most popular data set, you have multiple choices in OHDSI:
- Use one of the open source ETL code sets, contributed by Janssen:
It is a great code base and ETL specs. You will be at the mercy of that fantastic team to share their code into OHDSI (which they do quite regularly). Based, on my knowledge, it is based on Microsoft SSIS platform but it might have changed.
- Use one of the commercial vendors (including Odysseus where I am at) that do it for living. They can do the Truven MarkeScan OMOP CDM conversion for you using the latest and greatest Truven MarketScan ETL scripts, including support for OMOP CDM v5.3, latest THEMIS business rules, Truven specs and OMOP CDM vocabularies etc… And btw, it is Hadoop / Spark based and can be easily plugged into any Hadoop based platform, including Amazon AWS EMR or Cloudera.
The Truven MarketScan data set, while being one of the most popular, is also one of the biggest, so not only some needs to know the details of OMOP CDM, OMOP Standardized Vocabs and various business rules (look up THEMIS) but also know how to handle a large data set like that, including keeping it up to date.
You get the idea, happy to discuss and provide more hints. Good luck!
@dantohe - welcome to OHDSI!
Truven, being one of the most popular data set, you have multiple choices in OHDSI:
- Use one of the open source code sets, contributed by the Janssen team:
It is a great code base and ETL specs. You will be at the mercy of that incredible team to share their code into OHDSI (which they do quite regularly). Based, on my knowledge, it is based on Microsoft SSIS platform.
- Use one of the commercial vendors, including Odysseus where I am at, to perform the Truven MarkeScan conversion for you using the latest and greatest Truven MarketScan ETL scripts, including support for OMOP CDM v5.3, latest THEMIS business rules, MarketScan specs and OMOP CDM vocabularies etc… And it is Hadoop / Spark based and can be easily plugged into any Hadoop based platform, including Amazon AWS EMR or Cloudera.
You get the idea - happy to discuss it.
Thanks a lot for the direction - we will evaluate both options.
Our current goal is to load the Truven dataset completely into CDM. This would allow us to use all the currently available tools for data analysis and such. We also need to make a cost analysis evaluation. Could you briefly summarize where the tools offered by Odysseus are overlapping with the ones offered freely by Janssen and where not?
Thanks a lot.
The free open source CDMBuilder tool that Greg mentioned is available here:
The CDMBuilder Documentation is here:
It’s a .NET application that works with AWS Redshift or MS SQL Server.
My company, LTS Computing, provides OMOP CDM ETL conversion services, including the Truven & CPRD datasets. We use pure SQL ETL scripts that can be adapted to any of the major database platforms on premise or in cloud.
The CDMBuilder tool can convert Truven CCAE, MDCR and MDCD. We’ve also developed SQL ETL to convert the Truven HPM (Health and Productivity Management) database if you need that.
Feel free to direct message me here, or use the contact form on our website if you would like additional information.
We also deploy the free open source OHDSI analytical tools (Atlas & OHDSI R packages) and provide OHDSI tool support services for clients.
your help is greatly appreciated.
So just to make sure that I understand (as I mentioned before I am new to both Truven and CDM): using the open source stuff you pointed to me will allow me to load raw Truven data (the way it was delivered to us after we purchased it) into a CDM database.
Please correct me if my assumption is not correct.
Basically, there are three aspects to every ETL:
- Capability / Skills to use it
- Service around it
Erica and her team create solid tools. For you, if you are Ok with running the .NET-based ETL and can do some tweaking on a receiving side so that you can adopt it to your specific Truven MarketScan data set (e.g. what supplemental data you have - HPM, Early View etc., how it is exposed etc…), making sure it give you an acceptable conversion/mapping coverage etc… - then you are all set to go. You also need to be at peace with whatever OMOP CDM version targeted in this tool (5.1?) as well as wether it reflects the latest and greatest THEMIS business rules as of the day you run it. You could also adopt the ETL spec to the technology of your choice, upgrade it yourself and contribute it back to OHDSI.
The ETL tools that Odysseus offers are also SQL based, but can also easily run on Hadoop/Spark (Cloudera, AWS, GCP) or other platforms. Being a commercial vendor and doing it for living for many of our customers, we maintain our tools and provide services around OMOP CDM conversions:
- support the latest version of the raw data
- target the latest OMOP CDM version (it is 5.3 today)
- implement the latest THEMIS business rules
- create custom mappings, if something is not covered by standardized vocabs
- tool scales well for the data size of this type (Truven Commercial Claims is a very large data set)
- ensure conversion / mapping results are high and acceptable, as per THEMIS business rules
- we help to install, configure, run and QA/validate the results
- and support you in production in case something is not right
So, if you believe you are happy with a tool/technology and also your organization have the skills and capability to adopt and maintain it - the open source tool above is a great choice. If you feel that you need another technology option or need skills to adopt/maintain the tool and have the service around OMOP CDM ETL - then you probably want to reach out to commercial vendors like Odysseus (Greg) or LTS Computing (Lee) .
Happy to chat through this, please feel free to reach out to this email - email@example.com
Thanks a lot for the explanation.
ETL-CDM Builder (Option number 1 mentioned by Greg and Lee) is indeed a tool based on .NET FW,
but recently moved to .NET Core which makes it possible to be ran by AWS Lambda (or any number or Linux or Windows VMs either in AWS or any other environment) and achieve great conversion performance and costs. That is not available in the repo yet, but should be there within the next 2-3 weeks. Supported CDM version would be 5.3.
New release will also be modular and provide modules which can be used separately in any ETL tool like Talend / Microsoft SSIS / etc.
The flow for conversion using AWS Lambda would be:
- Upload the binary to AWS Lambda
- Make data chunks from Redshift in S3 (handled by a tool supplied with ETL-CDM Builder)
- Wait until CDM Builder gets it converted (takes 1-2 hours to convert full CCAE).
- Upload data from S3 to Redshift (also as simple as one AWS copy command per CDM table)
ETL-CDM Builder supports conversion of the following databases into CDM out of the box:
- Truven CCAE / MDCR / MDCD
- Optum Extended SES / DOD
- Optum Panther
Please feel free to contact me at firstname.lastname@example.org if you have any further questions on ETL-CDMBuilder usage in any environment or any inquiries regarding your CDM needs.
From what I can see there are four databases needed for the process:
- The builder database (internally used by the builder)
- The source databse (I imagine this should contain the original raw Truven data)
- The OMOP Vocabulary database
- The destination database - this will contain eventually the Truven data in CDM format
I have a few questions:
A) The source database (#2) can this be on a Vertica repository or it must be in SQLServer or Redshift?
B) IS there some resource for getting and creating the OMOP Vocabulary database?
Your help is very much appreciated.
A) I’ve only used CDMBuilder with SQLServer and Redshift. In general, Vertica isn’t one of the databases currently supported by the OHDSI tools.
B) Download the latest OMOP Vocabulary data here:
The OMOP CDM DDL (which includes the vocabulary tables) for all the OHDSI tool supported DBMSs is available here:
A) I never tried but it should works with any sources that have ODBC drivers.
You can try following option: Vertica as source and MS SQL as destination. If you will have an issue with Vertica as source, just add it here https://github.com/OHDSI/ETL-CDMBuilder/issues and I will try to help you.
It looks like the files (all except the CPT that I need to build after getting the UMLS license) are csv and they are to be bulk loaded into the schemas. If my assumption is wrong let me know.
Otherwise - thank you very much!
We have the Truven Marketscan already in Vertica - I will try it from Vertica frist and let you know.
@gregk @lee_evans @bradanton @e.alex @Kangc3333
I question regrading the OMOP Vocabulary loading - I just finished loading 9 files into the CDM schema (Redshift). There is a 10th file called CONCEPT_CPT4.csv (I guess it is the result if downloading the CPT codes from UMLS).
For this file there is no equivalent table in the CDM schema.
Any advice regrading what I should do with this file is appreciated.
Hi all, quick question for the community that has ETL’ed Marketscan into OMOP. If we push Marketscan in OMOP CDM is there a loss of data or is there a 1:1 match between marketscan fields and OMOP table/fields (Cost table, etc). I think there are fields like “Sales Tax” in Marketscan that cannot be pushed to OMOP simply because i dont see anyequivalent in OMOP…but wondering if anybody had a view in terms of the “loss of data” rate when storing Marketscan claims data into OMOP. Are we talking an insignificant 1% or more? Thanks all! Damien
We have been working on our MarketScan conversion and there is not a 1:1 conversion between fields. I do not know how familiar you are with the CDM but there are a prescribed set of destination Tables and Fields and only those get populated.
Have you reviewed the CDM specifications?
ok that is what i suspected. Thanks a lot!
PS: yes, we have looked at some of the specs and tables available in OMOP.
Great, let us know if you have questions. The community has been very helpful.