ETL for a SAS-based registry study

Christian_Reich · April 14, 2017, 5:20am

SAS is not supported by the OHDSI stack out of the box. So, you have two choices: You either write the data into a relational database and use the Rabbits, or you do everything by hand: You take the CRF/data dictionary, and go table by table mapping and converting your data to the [OMOP CDM][1]. You will also need to convert your codes using the [OMOP Vocabularies.][2], which you need to pick up in [Athena][3].

Probably time to write a step-by-step cookbook for somebody.

Good luck.
[1]: http://www.ohdsi.org/web/wiki/doku.php?id=documentation:cdm#common_data_model
[2]: http://www.ohdsi.org/web/wiki/doku.php?id=documentation:vocabulary:sidebar
[3]: http://athena.ohdsi.org/

nitishkjha · April 14, 2017, 5:55am

Hi Christian,

Thank you so much for replying.

Kindly suggest me on the following points:

Can I use SAS datasets in Rabbits ? if Yes then how?
I have downloaded some vocab from athena However i am still not very clear that which are the dictionary I should download.
Some CDM variables like _type_concept_id etc. , which dictionary need to be downloaded and from where?

Kindly help me.

regards,
Nitish Jha

Christian_Reich · April 14, 2017, 6:51am

@nitishkjha:

No. SAS won’t work. You either don’t use the Rabbits, or you get a database and store the data there.
Depends on what vocabularies are used in your source data. If there aren’t any standard vocabularies, and instead they used their own controlled vocabularies (which is typical for registry studies) then you have to map by hand using Usagi.
All the ones you really need are clicked. Just download the way it opens by default, and you should be alright.

Christophe_Lambert · April 14, 2017, 1:24pm

Note that it is relatively straightforward with a few lines of python or R code to export a SAS file to text, see: https://pypi.python.org/pypi/sas7bdat. From there you can write an ETL to convert the text files. My experience of writing ETLs is that it is a big project to get it right – can take several man-months, particularly if you are doing it the first time. That being said, it is a great exercise to really understand the vocabularies in depth as you convert from your existing data model to the OHDSI one. You might take a look at the https://github.com/OHDSI/ETL-CMS project that starts from Medicare text files, and see if you can adapt that code to meet your needs.

Christophe

nitishkjha · April 17, 2017, 4:19am

Hi everybody,

In my raw dataset I have a domain which is about employement status of subjects. It has field for example :

Are you employed?
2.Working status (Full time/part time)
Not working status
4.Financial coverage(Provincial/Private)

Now I would like to know that should i place it in observation table ? If yes then how?

Vojtech_Huser · April 18, 2017, 9:35pm

Start with a person table. Get year of birth of all patients and gender and populate the person table.

If you have data on diagnoses - do the condition_occurrence table next.

Christian_Reich · April 18, 2017, 11:56pm

@nitishkjha:

1, 2 and 3 are all Observations. There are a ton of good concepts you could write into the OBSERVATION.observation_concept_id field:

select * from concept where concept_name like '%employed%' and domain_id='Observation';

4: Is this insurance coverage?

nitishkjha · April 19, 2017, 4:00am

Dear Christian,

I will try to implement it as per your input.

Thanks a lot…

nitishkjha · April 20, 2017, 5:46am

Hi Everyone,

I am generating DEATH Table and I am having One subject more than once in my raw data.

Should I mention all three instance of the same subject in table.
cause of death is given like (Meddra Coding Preferred Term) in raw data or only one?.

1.ATRIAL FIBRILLATION (10003658)
2.HYPOTENSION (10021097)
3.INTERSTITIAL PULMONARY FIBROSIS (10022611)

start date and end date both are same for each observation.

Also, Kindly guide me that should i put Meddra PT Code (10003658) in CAUSE_SOURCE_VALUE and CAUSE_SOURCE_CONCEPT_ID would be the corresponding Concept ID of MedDra?
What should be the CAUSE_CONCEPT_ID in this case ? Would it be the SNOMED Concept ID From Vocab?

Please guide me as I am not sure of the same…

Regards,
Nitish