OHDSI Home | Forums | Wiki | Github

Medicare ETL development

As discussed on the call today, we will kick off a working group meeting for anyone interested in contributing to the ETL process for Medicare data. We will focus on the synthetic data, but we will also discuss how to ETL some of the data that is NOT in the synthetic data so that the process can be used for a broader range of Medicare datasets and a broader audience.

We expect to put up our thoughts on the ETL process in about a week (as a Word document), and to then have a meeting to discuss it in detail. Following that meeting, we will divide and conquer ā€“ giving each participant a small part of the task so that nobody has to do it all.

After the specification is done, we will then discuss the coding and similarly divide up the tasks as much as possible.

If we can get something ready for AMIA, we will do it. We can discuss that at the working group meeting.

If anyone has any questions and wants to contact me directly, you can do so at mark@outins.com ā€“ otherwise, just respond to this topic.

Per Patrickā€™s notes, the following people expressed interest in the process:

@donohara, @sblyman, @hripcsa, @aperotte, @ericaVoss, @Christian_Reich, @dsontag, @amatcho,
@bailey (interested once it gets to SEER), Rich Boyce (interested from nursing home perspective, Lee Evans (use data for test tools) (apparently you can list a max of 10 users in a post)

Hi Mark - got you message and am interested. If you would like, I could ask my admin to set up a doodle poll so the WG can get a couple calls in during February.

-Rich Boyce

That would be awesome. Thanks very much.

Would anyone who is interested in this working group and willing to participate in a couple of phone meetings please fill out this doodle poll?

http://doodle.com/usbme46i4i6ucwgv

Please ignore the title of the Doodle. I will asked admin to change that to ā€œMedicare ETL development WGā€ and she should get to that tomorrow.

Iā€™m interested too, Mark. Iā€™ll add my availability to the poll

Bill

It looks like these 3 dates work for everyone:
Wed Feb 4 from 12-1 pacific (3-4 eastern)
Tues Feb 17 from 12-1 pacific (3-4 eastern)
Mon Feb 23 from 12-1 pacific (3-4 eastern)

Suggested agenda for the first meeting:

  • Introductions (experiences with ETL and Medicare): 10 min
  • Discussion about available software for the ETL process: 15 min
  • Divide into working teams: 10 min
  • Discussion of known Medicare idiosyncrasies: 25 min

I am going to suggest that we divide into several small working groups of 1-2 people to write the specifications for subsets of the CDM tables. Individuals can contribute by helping to write the specification or by helping to code the ETL, or both. If Frank is available, a 5 min discussion of CDM Builder would be helpful (or we can table the discussion until a future meeting). The plan will be to use whatever software works for most of the people who can do the coding.

The plan is to put a spec together for the synthetic data first. Then to expand this to the SEER Medicare data, and then to the 5% sample (in order of complexity).

Our organizationā€™s expertise is NOT in ETL, so hopefully we can rely on others who have done this before. If anyone has suggestions on how to do this better, I am happy to get feedback publicly or privately.

We can use this dial-in information (below). If you would like me to send you an invitation directly to your email, please email me at mark@outins.com

Join the meeting: https://join.me/markdanese

On a computer, use any browser with Flash. Nothing to download.
On a phone or tablet, launch the join.me app and enter meeting code: markdanese

Dial in numbers:
United States - Camden, DE +1.302.202.5900
United States - Detroit, MI +1.734.746.0035
United States - Hartford, CT +1.860.970.0010
United States - Los Angeles, CA +1.213.226.1066
United States - New York, NY +1.646.307.1990
United States - San Francisco, CA +1.415.655.0381
United States - Saugus, MA +1.781.666.2350

Access Code 104-159-812#

I can best contribute with experience working with a Medicare ETL (ETL DOCUMENT) as well 2 other ETLs weā€™ve done here at Janssen. I also have experience working with CDM v5 as we are preparing to do so here at Janssen and Iā€™ve helped the Erasmus MC team convert their data to CDM v5.

I also have experience in White Rabbit and Rabbit in a Hat (WIKI) which is a tool to help document and drive the ETL process.

Iā€™m probably not the fastest resource to help build an ETL as my languages of choice are SAS/SQL and I know from personal experience while it is possible to write your ETL in SAS/SQL it is probably not the best/most performant option. I did not develop CDM_BUILDER however I work closely with the team that has implemented here at Janssen.

Looking forward to our chat tomorrow.

Just wanted to retag this post with the meeting date and contact information (above). I can only include 10 peopleā€™s tags so I will skip Erica since she already responded. Others are certainly welcome to participate.
@donohara, @sblyman, @hripcsa, @aperotte, @Christian_Reich, @dsontag, @amatcho, @lee_evans, @rkboyce, @wstephens

Also, our proposal for subgroups to handle sections of the ETL are below. We can certainly re-arrange this, but we are trying to see if we can all work in parallel. There will need to be a harmonization process (Phase 2). And some table will be populated on an ad hoc basis.

Group 1: Condition, Procedure
Group 2: Device exposure, drug exposure
Group 3: Location, person, observation period, payer plan period, death

Phase 2: Cost
Phase 2: Visit, Care Site, Provider

Ad hoc: Fact Relationship, Observation

Not needed(?): Measurements, Note, Specimen, Standardized Derived Elements (Cohorts, Eras)

Discussion items as time permits:

  • Line vs. claim diagnoses and dates (also applies to some procedures
    that are reported on intervals)
  • Facility file vs. physician file ā€“ potential duplication of information (procedures and diagnoses and
    visits)

Iā€™d recommend we take a look at some dependencies on the tables before we
split out. For example, we need to define visits before we can do
conditions or procedures if we want to have the visit occurrence id
populated in those tables. Also, itā€™s useful to get the observation period
table populated upfront, so you can ensure all subsequent tables are
populated with only data that falls within a valid observation period.
While the data structure is different, I think the sequence of tables that
erica details in the truven mdcr etl specification document in the etl -
cdm builder package may provide a helpful framework to consider.

I do think it may be useful to review the whiterabbit summary as a group,
so everyone is grounded with the same context. @aperotte has a good start
on the rabbitinahat document, but thatā€™s definitely an area for
parallelization in developing more detailed documentation once the
highlevel decisions are made as a group.

My two cents on the outstanding items: I donā€™t think itā€™s an issue to have
diagnosis information that seems redundantā€¦the era table will take care
of consolidating records. But we need to make sure we get the visit
definition right, and donā€™t just use every claim encounter as a visit,
because thatā€™ll dramatically inflate the count of visits. As erica has
shown in her bmc paper, standardizing visit definition can have a material
impact on disease prevalence and health service utilization markers, so
would be good to reach consensus on this and apply throughout all the cms
databases.

Excited to get this process started with all of you!

Cheers,

Patrick

Truven doesnā€™t have a separate file that contains ā€˜physician claimsā€™,
instead all claims are provided in inpatient and outpatient summary and
detail table, plus thereā€™s a facility header table.

In general, Iā€™d think an approach that could work for us is to combine all
claims for all sources, and then apply logic across all claims to determine
the inpatient episodes and outpatient/er visits that fall outside those
episodes. Clearly in cms, some of these claims records will contain
redundant information, but that shouldnā€™t cause any problems when we roll
things up.

My concern is the ā€œrolling upā€ step. Condition eras are sometimes useful, but I like to have more granularity around when diagnoses, procedures, and visits are recorded and how they relate to each other. And I like to keep inpatient facility, outpatient facility, and provider (physician) claim records separate (or at least identifiable). I donā€™t mind rolling things up when appropriate, but sometimes we are doing costing and we want to roll up costs in these groupings.

Also, we like to implement existing algorithms that consider the source of the diagnosis (e.g., inpatient or physician). That approach may be becoming antiquated, but I donā€™t want to prevent people from implementing these kinds of algorithms.

We can certainly discuss this tomorrow. It may be that my concerns reflect a lack of understanding and that no useful information is lost in the process.

In the ETL we implemented we keep as much granularity as possible in CONDITION_OCCURRENCE using the CONDITION_TYPE_CONCEPT_ID to help us navigate where information came from. We do some collapsing, for example if a diagnosis happens multiple times on a claim we take its lowest position it occurred. But for the most part we try to preserve information in the raw data.

However, we have done some work on reassigning mis-classified claims. For example, sometimes youā€™ll see outpatients claims that have room and board charge on them. Another example we see is if a patient was admitted to the hospital for 5 days and you see outpatient claims during those times, the patient most likely did not leave the hospital to visit another physician. We did some work (waiting for it to be published) that showed that if you apply this standard inpatient methodology across claims datasets they become more consistent in their inpatient condition prevelances.

This isnā€™t an ā€˜either/orā€™. You preserve all the granularity in the
CONDITION_OCCURRENCE table, and from there you can decide to restrict your
focus to inpatient/outpatient, primary/secondary. CONDITION_ERA is an
additional construct (not replacement) for when you just want to know what
conditions a person has. A rare time when you can have your cake and eat
it too:)

I wonder if it wouldnā€™t be wise to collect a compilation of the transmittal file layouts that the OHDSI community is most likely to receive usable (so not any and all files) claims data in, and then work on ETL decisions that accommodate these most common data inputs. Iā€™m not sure if this would be valuable for Medicare or not- I was surprised when we received very different data transmittals from Medicaid entities in different states- not sure if this holds as true for Medicare.
If I looked at my Medicaid (not Medicare) data I get the following files types

  • Client data
  • Eligibility
  • Enrollment
  • Provider Data
  • Provider Address
  • Provider Specialty
  • Pharmacy Claims
  • Professional Claims
  • Institutional Claims
  • Institutional-other procedure
  • Institutional- other diagnosis

Hi Lisa:

That is what I was thinking in terms of dividing into groups and focusing on sections of the ETL process. Then we can get to the implementation part after all of the thinking and documentation is done. It will allow the groups to discuss other Medicare formats (5%, SEER Medicare) and any other idiosyncrasies that exist in those. Then, as Patrick says, the implementation should be done in a very specific manner. But again, we can all discuss this at the meeting. My goal is to have a well-documented process, and at least one implementation (for the synthetic data). But I would like to try to get to implementations for the other common Medicare data if we can.

Mark

@ericaVoss For CDM v4, the CONDITION_TYPE_CONCEPT_ID field is limited to Outpatient header, Outpatient detail, Inpatient Header and Inpatient Detail. If we are to specify whether data came from a physician claim or an outpatient facility claim, how would you use this field? And this is just specific to conditions. What about specifying this distinction between procedures (i.e. does this hcpcs come from an outpatient facility or does it come from a physician claim)?

Could we as a team follow the best practice process highlighted by OHDSI, maybe something we could discuss today of how our team could work through this?
http://www.ohdsi.org/web/wiki/doku.php?id=documentation:etl_best_practices

@aperotte has started this process with Patrick, but I think we should complete it as a group.


These are the CONDITION_TYPE_CONCEPT_IDs that made sense for our data. You donā€™t have to use them and you arenā€™t restricted to them. Here are all the CONDITION_TYPE_CONCEPT_IDs currently in the VOCAB:

SELECT *
FROM CONCEPT
WHERE VOCABULARY_ID = 'Condition Type'
ORDER BY CONCEPT_NAME

If there isnā€™t a type that suits your needs we could request more from the VOCAB team. . . but I would like to understand the research question that requires you to be able to strip apart physician claims from outpatient claims.

@ericaVoss
Thanks for the link to best practices. I am not sure how to do this with a large group, but we can work that out today. Also good to know there are more types that we can access beyond the 4 Jen listed.

Physician claims can be for an office visit, for a visit to an outpatient facility (and be part of a record from the outpatient facility file) or for an inpatient stay (and part of the record from the inpatient facility file). That is what we are trying avoid losing. One use case is costing studies where you want to attribute costs to an inpatient stay, for example. There you would need information from both sources ā€“ inpatient facility and physician file.

Why wouldnā€™t you just list them as outpatient and inpatient claims as appropriate? Are you doing costing studies the focus on Physician claims vs Inpatient/Outpatient? Iā€™m only asking to understand the use case.

@ericaVoss: The synthetic Medicare data comes in three files: Inpatient facility, Outpatient facility and Physician files. Please see link to where the data is available for download: http://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs/DE_Syn_PUF.html

Medicare claim sets for analysis are available in a form that is closer to itā€™s native source. In the real world, physicians send claims, so we get a physician file. Facilities send claims, so we get facility claim files (generally split by inpatient=acute hospitals and outpatient=every other medical facility). I have only seen Truven roll up both outpatient facility claims and physician claims into one file.

I have to get familiar with what is available for this field in CDM v5. We may need to make a request to the VOCAB team. But we still have solve this problem for the Procedure Occurrence table as well. We want to preserve as much granularity as possible.

I figure we can get into this more during the meeting. But I wanted to point you to the actual files so you can see how these files compare with Truven data.

t