OHDSI Home | Forums | Wiki | Github

ETL working group (formerly "Medicare ETL working group")

(Mark Danese) #1

This year’s focus is on oncology data. Right now I have @blm14, @dckc, @amatcho, @clairblacketer, @ericaVoss, @jenniferduryea, @mlgleeson interested in participating. My apologies if I missed anyone. Please let me know if you would like to participate.

The goal is to clarify how to load the data into tables, decide what data goes in what tables, whether to do vocabulary mapping or create an oncology vocabulary, and to generate ETL code to help populating tables. The focus is on SEER and NAACCR data.

Keep in mind that there are 153 different tumor sites, each with up to 30 different “site specific” variables. So, it is not a small task to get this sorted out.

We have some mapping done for about 50 terms, and @dckc has done a complete load for i2b2. So, we do have a head start.

Please reply to this message if you are not included, but would like to be. Or if you are included and would like to be removed. No need to reply if you are mentioned and would like to participate.

First meeting will be in about a week or so, at a (hopefully) agreeable time.

(ben may) #2

Another challenge is that NAACCR seems to change the record layout just about every year. I think they are on version 15 or 16 already.

(Dan Connolly) #3

I’m missing some context. How is Medicare relevant?

I’m still pretty new to OHDSI, so I don’t know how working groups work around here. Did I miss some orientation matierals?

Another form of orientation I could really use is some story telling (use cases).

(Lee Evans) #4

I would also be interested to participate @Mark_Danese.

Now that the focus is on oncology data, I suggest we drop the word “Medicare” and just call it the “ETL working group”.

(Mark Danese) #5

Sorry @lee_evans – I should have assumed you were interested. Will include you. We can change the name.

(Mark Danese) #6

Sorry – this “workgroup” was originally just a project to ETL the Medicare Synthetic Public Use (SynPUF) files into CDM v5. But it is living beyond its original purpose, and some of the members are working with the SEER Medicare linked data. So, we want to work on the SEER part this year. Hence, Lee quite appropriately suggested changing the name. Not sure I can edit the title but will try.

(Dan Connolly) #7

I’m afraid I’m more disoriented after that name change “ETL working group” suggests a huge scope.

I guess I should elaborate on what I’m after:

The main project I work on is HERON, an installation of i2b2 at KU Med Center. When we started development, we sketched out things like our PatientCountStory. i2b2 has a user interface and I more or less know who the HERON user community is.

In contrast the goal stated above…

to clarify how to load the data into tables, decide what
data goes in what tables, whether to do vocabulary mapping or create an
oncology vocabulary, and to generate ETL code to help populating tables.
The focus is on SEER and NAACCR data.

… is hard for me to get my head around. The way you put data into tables all depends on how you want to query it, in my experience.

When we started adding NAACCR data to HERON, our use case was something like “count patients with grade 1 tumors.” We do better when we capture use cases that are actually interesting to researchers. For example, when our trauma registry folks wanted us to integrate their data with HERON, we met with them and came up with:

Use case: labs for people who came in with liver ulcerations.

In GPC, we’re working on integrating geocoding data (#140). There you’ll see a March 22 comment that captures use cases such as “How many diabetic patients (ICD9: 250) reside in rented property vs. own home.”

When I started working on site-specific factors, the use case was basically “count ER+ breast cancer patients.” It was implied by the fact that ER+ was in there as “site specific factor #1” but you had to have the NAACCR and ICD-O manuals at your side and wield some level-7 i2b2 query magic to actually use it. Now it’s straightforward to navigate into ** Site-specific factors / Breast / 01: Estrogen Receptor (ER) Assay** and drag 010: Positive/elevated in as a query term.

The “FORDS/NAACR tumor registry data” thread started with “We have interest at our institution on getting our ACOS tumor registry data into OMOP.” That could take any number of forms… anything from a simple flag that says “this patient is in the ACOS tumor registry” up to some sophisticated integration with imaging data. I’d like to hear some stories of how somebody would use the results of this working group once it has achieved its goals.

(Mark Danese) #8

Don’t worry about the name of the group. Nobody is obligated to do any more than they have time to contribute. It is just easier to keep it generic so we don’t have to change it every year! :smile:

We can certainly discuss use cases. There are probably 90 fields we need to populate, so I don’t expect to generate use cases for all of them. Particularly things like histology, location, grade, behavior, etc. In fact, we have gone through this exercise already with the NCI, and have a nice list of variables we want to include, and details about how to load them in the OMOP tables. But for the site specific factors, the idea of a use case and details is very welcome.

The i2b2 model is a bit different conceptually from OMOP so we will get the opportunity to learn about both approaches.

(Dan Connolly) #9

If you have a pointer to that exercise with the NCI, I’d appreciate it.

(Mark Danese) #10

There is no online version of the resulting mappings from the SEER data to the OHDSI concept IDs. But when we get the spreadsheet cleaned up, I will definitely send it around (and probably put it in a more machine readable form too).

(Anto Thomas) #11


I’ll be interested in participating.
I just joined the OHDSI community recently. We’ve setup a demo/dev instance and have started getting familiar with the model and the tools.
I was just starting to investigate how we can store more granular cancer phenotype data in the CDM. Another objective I have is to be able to capture more granular cancer biomarker data (genotype) from annotated whole genome sequencing.

Thanks and I look forward to working with you and the rest of the group on this.

Anto Thomas

(Michael Gurley) #12

I am for sure interested.

(Chad Smathers) #13

I would be interested in participating as well.

(Eric Schneider) #14

I too am interesting in participating if it isn’t too late. Thank you!

(Mark Danese) #15

I apologize – despite the best of intentions, I don’t have the time right now to lead a workgroup this year, so it is on hold until someone else can do it, or until later this year when my schedule frees up some. However, I am happy to share what we have done with oncology data. As soon as I get it cleaned up, I will upload it to an appropriate location.

(Patrick Ryan) #16

Thanks @Mark_Danese for your leadership over the last year in this space.
It seems there is a lot of enthusiasm from the community around having a
regular discussion around ETL best practices. All we need is someone to
volunteer to coordinate those efforts. If someone is interested,
@MauraBeaton can help fill you in on the logistics of running a workgroup.

(Michael Gurley) #17

I would love to see spreadsheet containing the mappings from the SEER data to the OHDSI concept IDs.

(Mark Danese) #18

Should be able to post it by end of next week. Just have some cleaning up to do but a few other projects in the way first.

(Mark Danese) #19

I posted this elsewhere but our current version of the mapping is here: https://www.dropbox.com/s/2gl5az5duq7z72q/Final%20SEER%20variable%20with%20codes.txt?dl=0

(Selva Muthu Kumaran Sathappan) #20

Hello Everyone,

I would like to be part of ETL working group. Currently we are transforming our EHR data to OMOP CDM format and encountering few issues. So, I felt joining this group will be of help to me where I can learn from you people and clarify my doubts if any. Hope I have posted this message in the right thread.