FORDS/NAACR tumor registry data

blm14 · April 11, 2016, 6:51pm

We have interest at our institution on getting our ACOS tumor registry data into OMOP. This is based on the FORDS/NAACR code book and the source data is very complicated (approx 600 columns, dozens of which are disease-specific). Just wondering if anyone out there has had any experience at attempting to map this data set into OMOP. Thanks!

Mark_Danese · April 11, 2016, 8:09pm

Hi Ben:

The CMS workgroup is going to focus on oncology data this year. Specifically, we are looking at SEER data. This is based on NAACCR with quite a few SEER specific pieces. We can include you as part of the workgroup if you like. We have quite a few things mapped already based on some work we (my company) did for the National Cancer Institute. So we do have a head start.

Mark

blm14 · April 11, 2016, 8:20pm

that would be great! I was looking at the V14 coding book and was getting worried about all of the site specific factor columns. Thanks!

Mark_Danese · April 11, 2016, 8:31pm

unfortunately, that is where we need to do some work. There are ~150 sites each with site specific factors. It isn’t going to be pretty. But it will be easier with more hands on deck.

blm14 · April 11, 2016, 8:36pm

One other question: I am less familiar with the SEER data than the NAACCR database - will the work going into SEER be useful for files from an NAACCR data set? I ask because we will have an identified set and will want to be able to link patients’ tumor registry data with the rest of their clinical record in our OMOP instance. My understanding is that the SEER dataset is deidentified, and you cannot filter based on a specific site or institution. Please correct me if that’s not correct!

Mark_Danese · April 11, 2016, 8:43pm

You are correct. We do not deal with geographical location. But much of that should be in different tables in the CDM that are focused on locations and addresses.

SEER data has cancer site and most of the NAACCR data as well as additional variables that SEER uses to conduct analyses with data. I have not looked closely at the NAACCR data to figure out the overlap. Look at this link and it maps NAACCR fields to SEER fields. Note that the link might change next week when the new data is uploaded (it will say nov15 instead of nov14 in the url).

Mark_Danese · April 11, 2016, 10:58pm

I should also mention that virtually all cancer codes are in LOINC. I talked to Daniel Vreeman at LOINC and this is what he provided to me:

We’ve worked closely with NAACCR so that all of the variables in their NAACCR Data Standards and Data Dictionary are represented in LOINC. As you probably know, these are the variables sent to central cancer registries so I would presume there is a lot in common with the SEER set.

Here is a query that will show the top level panel codes (underneath which all of the individual data elements are linked) for the NAACCR collections:

http://search.loinc.org/search.zul?query=naaccr+panel

dckc · April 18, 2016, 8:04pm

HERON ETL has supported import of the NAACCR file format into i2b2 for some years. I just added support for site-specific factors recently. The output of our ETL process is obviously different, but the code we maintain might be helpful for getting a handle on the input. I would be interested to adapt the code for interoperability with OMOP. Does anyone have a design sketch on what the data should look like in OMOP tables?

The main reference is https://informatics.kumc.edu/work/wiki/TumorRegistry , which has pointers to much of the code. Unfortunately, we haven’t managed to share all of the code publicly yet, but we can add people to a password-protected github repository on request.

Mark_Danese · April 18, 2016, 8:41pm

Hi @dckc
This looks really helpful. I think 90% (or more) of what we are interested in will go in the observation table. So, what we are looking to do is to take each data item,

identify a relevant concept id for it (e.g., for grade, find the concept id that points to cancer grade in LOINC)
identify relevant concept ids for each possible response (assuming it not a simple numeric quantity/number)

But our first challenge before doing the above is to identify all of the site specific unique data items for the 153 different tumor sites, and all of the unique values for each of the data items. It sounds like maybe you have this in some form already?

I looked at the NAACCR site based on your link above, and I am playing with some of the data that is downloadable from NAACCR.

Thanks very much for posting this information. I need to get the working group started and will invite you to participate along with others.

dckc · April 18, 2016, 8:57pm

Yes… I just remembered where I do have much of these details shared: GPC ticket #150. It has a screenshot of the unique values for the heart site. I just attached csterms.py, which details how it works:

The American Joint Committee on Cancer (AJCC) maintains overall
management of the Collaborative Staging System and publishes a schema
for browsing as well as a zip file with data in XML.

Mark_Danese · April 18, 2016, 10:28pm

So you parsed that XML file already? We didn’t take that on. That is really a useful thing.

Now we just have to map these things to concept-ids in OMOP. Or, as I am increasingly thinking, it is probably better to create a NAACCR vocabulary and not map these things to another vocabulary since they are used so often in their native vocabulary, and are so specific to cancer. Probably something to discuss with @Christian_Reich.

dckc · April 19, 2016, 2:42pm

Yes, csterms.py parses the XML and (a) produces i2b2 metadata and (b) produces a big SQL case expression that determines which “schema” applies, given primary site and morphology.

blm14 · April 19, 2016, 5:53pm

I am happy to help with this effort in any way that I can (since I requested it originally!)

It’s likely that some (most? all?) of the columns should have concepts in the existing ontologies. No?

dckc · April 19, 2016, 6:45pm

If you’re aiming to put stuff in the observation table, the HERON ETL code should be quite directly relevant, as the OMOP observation table and the i2b2 observation_fact table have essentially the same structure.

Mark_Danese · April 19, 2016, 7:35pm

@blm14
my concern is that a lot of pieces wont have an existing vocabulary. I think maybe we need to make an oncology vocab.

To be fair, we were able to map a lot of things to LOINC. But the site specific factors are probably less likely to exist. But once we have them in a nice neat format, it will be easier to look.

Andrew · May 19, 2016, 2:54pm

@blm14 We are doing something very similar to what it sounds like you are doing: adding data from a FORDS/NAACR Tumor Reg to the same OMOP instance that contains EHR data. Our ETL is just getting started. We will do the EHR first before the tumor registry data.

@Mark_Danese and @dckc if it’s OK I would like to learn what you are all learning about OMOPing this data and then help out when we get to the Tumor Reg part of our work.

blm14 · June 21, 2017, 5:44pm

Hey guys - I just wanted to see if anyone had made progress on this!