A Fireside Chat with Patrick Ryan about Managing
Observational Data and Benefits of the Common Data Model
as noted by Shawn Dolley
(OK, there was no fireplace.)
Shawn (S): Patrick, how does an organization with an active CDM handle duplicate patients or lives in claims data? Is there some matching that can say ‘hey this person from Truven with this birthday and blood type and provider location and diseases looks to be the same person in this Optum data set, so let’s not count them twice’?
Patrick §: Each instance of a database, each CDM, is kept separate. Each CDM instance ties to one source. Some people ask ‘Why don’t you put all data into one big CDM, this is one of the reasons not to do that, because you’d have double counting’.
P: We would hope that within a single data source or database, an administrator can de-duplicate patients, but it is contingent on the source data. Some known data sets cannot link the patients. HCUP from AHRQ is hospital encounters, and they have no concept of linked patients, so each encounter is one record, and each encounter could be a unique patient, or it could be a previous patient. That’s just for that data source. So, if I have two data sources, there is no coordination of person id across databases. And usually you can’t find the same patient across databases after the fact. Now, if one had sufficient information to create deterministic or probabilistic mapping you could do such a thing.
S: So if you are not linking data sources into one large database with unique patient records, then why have multiple sources, and so many?
P: It depends on how you want to answer a given research question – and yes, any question could be answered by any single data source, but your confidence level goes up if you can corroborate against those other data sources. Since each data source has its own unique attributes and pedigree, the validation is more like putting slightly different lenses against the research question, which of itself can provide additional insights.
S: Is there a list of claims or other data sources ranked by popularity to organizations with an active CDM, so our group (or other stakeholders) can know what to work on first if we want to build data tools by source?
P: On the OHDSI Wiki, we have a sample list, but there’s no superset actual global list.
http://www.ohdsi.org/web/wiki/doku.php?id=resources:data_network . Some of the more popular licensable datasets include Truven, Optum, CPRD, Premier, multiple IMS datasets (e.g Disease Analyzer, THIN, Pharmetrics), as well as other non-licensable data sets (e.g. CMS via Resdac). Some vendors have E.H.R. data, beyond just claims, and those include IMS, Cerner, GE Centricity which became Quintiles, Humedica Optum. Often if organizations are doing research directly with health systems the OMOP CDM user will have access to E.H.R. data.
S: How often are data sources updated? It would be great to have a matrix of this.
P: It depends. In general, data vendors apportion contracts and the quarterly, semi-annually or annually.
S: When syndicated claims data sets are updated, are new records appended, or are old records audited, changed, where someone who can’t just overwrite their whole database needs to then go back and replace what has changed? Do I get a new batch of data each month if I subscribe? Or one big database once ever, with no updates?
P: Don’t think about it as a closed population where with each update you are only adding new people or visits. Some data vendors will, with an update, overwrite the whole previous dataset. Or you may have a situation where you do get new records, however old records can change, and sometimes whole sets of people can disappear or show up from nowhere. For example, if a payer has contract with employer start or end, suddenly a big group of people would come in when contract starts, and when it ends the payer has to delete the patients.
P: One approach is to use the current up to data version of that data source, store (but don’t actively use) the n minus 1 version of that source, and delete the older ones. Users don’t keep these versions forever. Some
organizations, customers of the data, definitely will apply the updates incrementally to an existing database.
Some data vendors might want customers to do a delete of the old data, and a new whole data bulk load. Due to this variation, if you want both incremental and the drop and replace, then you need two extraction, transformation and load (ETL) routines. One is for incrementals and one is for bulk load. At OHDSI we have advised new organizations in the past to do really good bulk load and try to find errors in the data and clean them. And sometimes the size of the data just precludes a drop and reload. For example, the Department of the Veterans Administration uses the OMOP CDM, but they do incremental loading. That is because of their data size and other factors, their bulk load takes a long time to run.
S: When a study is proposed in OHDSI, and some number of institutions decide to run that analysis and cohort against their data, and their results come back, and results sent to the study originator or principal investigator (PI), then isn’t that again a duplication issue? To wit, if site 1 has Humana/Truven/whatever data set, and site 2 also bought that same data, then when they return results, databases that are most shared across sites will have overrepresentation of certain patients vs the patients in the datasets only owned by 1 site?
P: A result is a set of aggregate summary statistics that are telling you the answer to the research question from the perspective of that source. This is not patient level data you are getting back from a site participating in the study. It is rather a summarization of what was learned. It is often an estimate incidence of the event. These are rates, rates of an outcome in part. My Truven data and other person’s Truven data may have different results, because each of us will get a different cut of the data, and how we designed our ETL’s are different, and for other reasons.
S: Is there an active data set under the OHDSI tools on the web. That is, if I go into Achilles or ATLAS now, is it just to teach me how to use those tools if I were to have an active CDM, or is there a fake active CDM with dummy data (or real data) that I can play with and actually find health insights (like how many people in 2013 in France had acne)?
P: The active data set under Atlas uses Synpuf, simulated data from the Center for Medicare and Medicaid Services (CMS), who have made a dummy claims data set available, for software development. Under Achilles, we have community contribution of synpuf.
S: What is the usual time lag in data – if I buy data today, is it from 2012? 2015?
P: This depends on the source. For commercial vendors, there is usually a 3 to 6 month lag of when the patient visits occurred, and then 3 to 6 month lag beyond that to be processed as complete, and delivered to an OMOP CDM
user. Today, the end of year release early next year for Truven may be good through 2016, but any data after that
might change in the future. The newer data might be faulty or possibly incomplete. Some of the government data has a multi-year lag, for example for the state Medicaid data, the latest data might be 2014.
S: What do you guys think of Sage Synapse repository for shared data and shared studies?
P: This is an opportunity area within OHDSI. There are multiple publicly available data sets, each in their own structure. If someone converted these to the common model, this would be a value to those who built it and the community. HCUP is hospital data. NHANES is cross sectional data. MIMIC is out there. Having these, as examples, standardized and coordinated in one place would be very helpful. And if you wanted to contribute that, I would point you to our recorded tutorial on how to do ETL’s that we made at the 2016 Annual OHDSI Symposium.
S: What are the top 5 objections that someone who isn’t a fan of OHDSI, who is a director of epidemiology or pharmaco-epi, or a CER researcher, likely to throw at me, and how do you respond?
P: Standardization of data produces information loss. People are scared of information loss. However, what we’ve found is standardization imposes data quality steps that improves quality of data. We’ve never found we lost good data when we move it through a standardization process. Every single time, we found and removed data when we learned the data needed to be removed—standardization points that data out quickly.
P: Another objection is ‘Epidemiology is an art not a science, so you can’t take an engineering approach to solving clinical problems.’ We would respond, for multiple reasons, that it needs to be a science. Needs to be reproducible and empirical.
P: A third objection we hear is ‘Every database is unique and has to be handled uniquely so applying a common approach is not a good idea’. Our response would be that every single data source that we come across, when we do discovery to build the ETL, it always starts with ‘yeah but this database is going to be unique and it won’t fit’ but now we have 50 data sets that have been converted to OMOP CDM and none of them actually had a unique that made the CDM not work. Yes, in part because the model is flexible. Just because a model is standard doesn’t mean you impose restrictions that keeps data out, rather if means that you are applying flexibility or conventions in a standard way.
P: A fourth objection is ‘Gee, that’s a lot of work, and standardization sounds like a lot of effort and I don’t see the return.’ We would respond that it may seem like a lot of work, however OMOP CDM applies a strict information model, which makes the hard work up front, in order to optimize results later. Do you impose the effort early to reap benefits later, or make no structure, and then have the hard part cleaning and organizing it at the end of the process. Yes, it might be easier to get it into i2b2 since you don’t have to impose a standard set of vocabularies, but you are
kicking the can down the road to later, so the analysis is hard because vocabularies aren’t standardized early.
P: A fifth objection we might get is ‘I’ve gotten familiar with how I do it now, so why change. I have my data in SAS data sets and I have a SAS programmer, so don’t make me change those.’ When it’s an organization who only has access to one data set, why would they standardize? That’s when we make the argument, wouldn’t it be cool if you could access other data sets.
S: Thank you Patrick for your thorough answers and your time.