OHDSI Home | Forums | Wiki | Github

How much data is sufficient

As OMOP has become a de-facto data analytics standard across Pharma and Healthcare, with each year the data size become larger and larger which puts a lot of strain on infrastructure, processing time, complexity as well as the overall costs of data storage and processing.

For example, one of the most popular claims data sets is Truven MarketScan. It can go back as far as 1995. Personally, I do not believe that going back to 1995 makes sense for several reasons, including the data quality. At the same time - what should the parameters be to determine a reasonable cutoff date - how far back is enough? 2010, 2015?

Of course, one of the main input variables here is the type of research - condition, drug of interest etc., but I would really appreciate if folks would comment with a more detailed and logical approach to how they possibly tackle this problem in their organizations.

@Patrick_Ryan @Christian_Reich @mvanzandt

1 Like

This is a useful topic, with the likely unfortunate answer of ‘it depends’.

It is quite often the case that studies we are undertaking have imposed calendar date restrictions (start and end dates). For network studies, if there is desire to look at effects in relation to time, then one needs to consider what is the overlapping time across databases, and there, it’s likely the case that the lowest common denominator is unlikely to span back into the 80s or 90s or even 2000s.

Further, in the context of CohortMethod, its a default value of a parameter to restrict calendar time to the periods where both target and comparator are observable, which makes good sense from a counterfactual reasoning perspective, but means that you are always limiting to the overlap period so some time is cut off. This is particularly pronounced if you are comparing some newer medical intervention with some established standard of care, and will find yourself limiting to the period after the new product introduction.

That said, it can be useful to examine trends in healthcare behavior and utilization over time, and having a reasonable long historical lens can be useful for observing macro-level changes. For example, treatment pathways can identify guideline concordance, and its often expected for their to be a multi-year delay as diffusion of knowledge/recommendations make their way through the healthcare system. At a high-level, it could be reasonable to want to make statements about the temporal change in the incidence or prevalence of a disease over the last 30 years.

But as you mention, data quality can have a big impact here, as it is reasonable to expect that data capture processes have changed substantially over time. In the context of US claims like in MarketScan, you can have changes in the billing forms, changes in the data summarization (e.g. how many diagnoses are covered), changes in the vocabularies (ICD9CM->ICD10CM in Oct2015), changes in insurance/formulary coverage, changes in treatments available (and in the case of drugs, which drugs are on formulary, branded vs. generic, over-the-counter), etc. These process changes need to be considered when trying to interpret temporal changes to make sure they aren’t biasing the health-related inferences you are aiming to make.

When we start thinking about discarding the 20th century as too far in the distance to be relevant, it’s making me feel very old so I have to push back on principle alone. Particularly given that all the good music came from back then :slight_smile:

1 Like

How do we handle it? lots and lots of branching logic and loops to clean up the data… or to ‘find’ LOINC codes. You know, before, pulling this out of the back of my head, 2017, some labs were not even giving LOINC codes with the results they would send back to us.

EDIT: and jump tables, as we have done this for several years, we see patterns where we know when we see a certain combination of datum that it equates to a certain code.

Yeah, Monday morning, I realize how bad my grammar is above.

1 Like
t