OHDSI Home | Forums | Wiki | Github

Prevalence by index year vs incidence

Hello all,

I have some theoretical questions regarding prevalence analysis with claims data on Atlas. Although getting disease prevalence is not very optimal through claims, I am trying to see the proportion of the population diagnosed with the disease in the US within a certain time period (2016-2019). When I calculate prevalence estimates by earliest event of condition and then use cohort characterizations to divide by index year, does that correspond to the number of patients newly diagnosed with that disease in that year? Would that theoretically not correspond to the numerator of the incidence estimate (especially if I pick the first time in patient history criterion)? With respect to that, how does incidence rate tab on Atlas implement the calculation for numerator differently within a particular year?

In addition, while looking into prevalence in general population, I would like to get your opinion on how to implement different lookback periods. Prior continous observation is not applicable when using observation period as the cohort entry. Also, when I pick 2019 as the observation period start date, does that only include those with their initial activation in the database for 2019 or also include people who were already active in the database beforehand? If so, I believe that is not very clear in the UI.

Please let me know if there is something I essentially interpret incorrectly about performing incidence and prevalence analyses on Atlas. I am very new to Atlas and trying to implement analysis in the best way possible, so any input to lead me to the right direction would be appreciated.

You are trying to calculate a period prevalence, e.g. the total time when people suffered from the disease over the at risk population in that period (one year). You need to figure out the numerator and the denominator from the data. Let’s dissect:

Numerator. You need to know all cases (patients falling ill with the disease) and the duration after which the condition has been resolved. Both depend on what kind of condition you are talking about:

(i) acute onsets (e.g. a trauma),
(ii) diseases that come and go (the flu),
(iii) chronic conditions that stay with you once you have them (diabetes).

For the (i) cases, prevalence makes no sense, because those are state changes without a defined duration, and incidence rates is all you need. For (iii) it is trivial, because it’s just the number of patients with the disease (earliest event any time before the end of your index year) divided by the size of the population. So, the only tricky situation are type (ii) diseases. The nomenclature is often very ambiguous. For example the act of breaking a leg is a type (i) onset, but having a broken leg that takes 4 weeks in the cast to heal is a type (ii). “Asthma” is used synonymously for the type (i) attack (patient can’t breathe and gets rushed to the ER), vs a status asthmaticus (an ongoing inability to breathe for days), which would be considered a type (ii) case, vs the type (iii) susceptibility to asthma attacks in general.

For type (ii) diseases, to calculate the prevalence, you would need to sum up the duration of each case in your index year, and divide that by the size of your population times the time at risk (Observation Period in that year). Claims data don’t make that easy, because they don’t say when the disease starts or ends. Instead, each time there is money to be claimed you get a record. But such a record will not tell you if it is a brand new case or still belongs to the previous record (disease still ongoing). Conversely, the absence of a record does not necessarily mean the disease is no longer present. You therefore have to do some work:

  • The simplest solution is you consider each record as a new case (index criterion for your cohort) and assume a standard duration. This only works if the records are rare and the duration can be reasonably assumed. If that’s not the case, you need to do one of the following solutions.
  • You create some heuristic for the cohort start date (e.g. diagnosis in timely context with some diagnostic procedure), or end date (e.g. curative procedure).
  • You start at the first record and make the assumption that after a certain amount of time without any record the disease is over. Atlas lets you chain records till you get a large enough gap.

Denominator. If the population is defined by you (e.g. if you want to calculate the prevalence of complications in your diabetes patients) it is straightforward. If you want to just use all patients in the database you can use the Observation Period as a cohort definition. But if you need the true prevalence in the general population you need to “project” or extrapolate your cases. Atlas does not have that functionality today. In order to do that, you need some mechanism to estimate your sampling rate. Often times, that is done by using the providers, whose total number is known at the national level. A better estimate can be achieved by stratifying the extrapolations by provider specialty, which can also be obtained independently.

You have to be careful with that, because prevalent cases may start in the year before, or because of other artifacts. The first time criterion does not make sense for prevalence calculations. They are used for disease first occurrence incidence rates.

You’d only worry about that in the denominator for type (iii) cases, like diabetes, where patients freshly added to the database have not had enough Observation Period in the past for the chronic disease to be captured. As a result, the patient would show up as a false negative in the denominator, rather than in the numerator. I would not worry too much about that misclassification. Chronic diseases bother the patients and tend to lead to records.

Thanks a lot @Christian_Reich, this was very helpful. If you do not mind, I have a few follow-up questions.

In fact, I am also struggling to place the disease in one of those three categories. I am interested in sleep apnea, however, since I am looking in the pediatric population, the duration of disease varies quite a lot depending on surgery outcome or patients outgrowing disease. At this point, because I am only looking for a rough estimate, I would go for the simple solution that you are offering here:

However, I am not really clear on how to perform that. Let’s say that I assume disease duration to be around 3 years. Does it mean that I would look at the earliest event any time before the end of the specified 3 year-period? Is this essentially the same as the trivial calculation you have offered for type (iii) diseases, except for a more defined time window? Is there anything else that I would need to perform on Atlas?

At the moment, my aim is to find all people in the database within that specified period in the denominator by using Observation Period. When I look into the number of patients with the disease record within the period of interest, that number is not a true subpopulation of the cohort I create with Observation Period criteria. It captures way more people than when I try to create subgroup analyses on characterizations (which I know is not the right way to go about with the analysis but I wanted to check whether there were major discrepencies). Hence, it would be implied that not everyone that I define as diseased would be a part of the denominator or I believe people would be counted more than once. Do you believe that is an artifact of the approach or is there something that I am essentially wrong about here?

On another point, how do you feel about getting prevalence estimates via incidence rates in a defined period of suspected disease duration? Do you think reliable estimates could be obtained by multiplying the incidence rate with the duration of the disease? I am asking this question as the incidence numbers seem more robust for my analysis using claims data, but I also understand that this might be a stretch.

Thank you very much again, I find Atlas to be a useful tool but I am still trying to get to understand how to best utilize it for different research questions.

I agree, it is a (ii) case (disease comes and goes), but a fixed duration seems a bad assumption. Children grow out or have surgery. This should be the criterion for the end_date of the cohort. You would end the cohort when there was no record for, say, a year, or you would censor it using a procedure concept.

Not sure I understand. You create a cohort of people with an Observation Period for the denominator. And you create a cohort of people with the disease, which automatically assumes an active Observation Period for the numerator. The latter will be a subset of the former.

Which is fine. The numerator contains cases (not patients) with their duration. Two cases could be contributed by two different patients, or by one patient who has the disease twice in a row.

You can, but it is dirty. Because the incident users could have started the disease before the index year, and then you would miss them. Also, if the condition is not rare another incidence of a disease could be within the duration of the previous one, at which point you should not count it. So, for a back of the envelope calculation it would be ok, but the right thing to do is to get that end date figured out. No easy way out. :slight_smile:

Thank you again @Christian_Reich for this great explanation and bearing with my questions (which probably not all make sense).

I agree that the fixed duration is a bad assumption. However, I would like to capture a wide enough period during which I can still acount for patients who might have residual disease. Those that did not recover after surgery or those that did/did not grow out of disease are stilll of importance. While I understand the method to censor the events using surgery procedure on Atlas UI, I do not think that I am aware of censoring for the conditional no observation for 1 year. It would be great if you could direct me to the right OHDSI source to explore that further.

What I was referring to here is that when I use the Characterizations tab and enter disease diagnosis as a subgroup for the cohort created with Observation Period, less number of cases are captured than when I directly run the disease cohort for the same timeframe.

Is this also fine in the context of prevalence calculation? If the denominator does not count people more than once, then the numerator cannot be a subset of the denominator. When there are new cases due to a new claim being made not particularly due to a re-diagnosis, would it be accurate to count that as a new case? Is it not better to use the number of patients for the numerator? I was also confused by Atlas UI which labels case counts as People in the Generation tab of Cohort Definition.

My interest would be to retrieve the number of people in a certain period who are possibly living with that disease. By picking a predetermined period like 3 years, I was hoping to eliminate some of the variability due to changes in the disease state. I am aware that this is not a sophisticated way of going about it, however I still try to get an understanding of the disease burden within a defined timeframe.

You may just treat it as a type (iii) disease: Apnea patients stay with the disease indefinitely. That would make your life easier.

I believe same way. You add a censoring event of having no Condition Occurrence of apnea.

That’s a question for @Chris_Knoll, I’d say.

Sure it can. Because remember: It’s people-time. In the numerator, each case contributes with its duration, in the denominator, each patient with an entire year. So, you can have two patients with 3 months worth of sleep apnea, or one patient with 6 month, and the prevalence would be the same.

As discussed before. You need to define the end date. The very first claim after the end date of the previous case will be a new case.

It does both.

Yes, that’s case (iii) chronic disease. If you do that it becomes simple, because both the times in the numerator and denominator are one year, and the prevalence becomes people with disease over all people.

I should correct myself. In the denominator, there could also be multiple occurrences of the same patient if the Observation Period is split into pieces. So, in essence you are summing up disease case durations overlapping with the index year and divide that over the Observation Periods overlapping with the index year.

Thank you very much for your help - I really appreciate it and hope that I did not take too much of your time asking for additional questions. I believe for now I will stick to the assumption that it is a type (iii) disease.

I would gladly like to get @Chris_Knoll’s input on the discrepency in the Characterizations tab on Atlas. Additionally, when I implement censoring criteria for occurence of apnea and surgery, I seem to obtain the same numbers as without censoring which does not seem right.

I must admit that I am slightly confused by the approach to calculate prevalence with people-time. My aim was to retreive data based on the number of patients within a certain period of time with respect to the number of people enrolled in the database, which is also why I have implemented the ‘earliest event’ restriction on Atlas. Would people-time approach be relevant for type (ii) diseases? If so, would that not be better obtained by Incidence Rate tab, with cohort definition to ensure people with prior diagnosis would also be counted?

As you have also mentioned the possibiliy of multiple Observation Periods - is there a way to tackle that in the CDM and ensure that patients would be counted only once? I understand that this is a naive question and I know of the limitations inherent within claims data, however, I still wanted to get your opinion what could be a more reliable way to calculate prevalences in general population.

Sure, what is the discrepancy?

The discrepency I refer to is probably related to something that I conceptually get wrong on Atlas and I believe this might be a naive question, so apologies if it does not make sense.

I realized that when I construct a cohort using observation period to capture the general population, enter that in Characterizations tab on Atlas under Cohort definition and enter the diagnosis as a Subgroup, I get a considerably lower number of patients captured within that cohort with the disease in comparsion to the number I get when I construct the disease cohort under Cohort Definitions. I would expect for those who were diagnosed with the disease to have an active observation period in that timeframe. (I use the earliest event criteria for the diagnosis,to obtain the number of persons and not events.) I would really appreciate if you could help me understand why this disease cohort is not a subset of the cohort created by observation period. This is relevant for me as I would like to find a way to address prevalence within the population in a certain timeframe.

Another discrepency I often face in Characterizations tab is when I try to define an additional condition following the prior disease diagnosis by using subgroups vs. when I define the same condition as an inclusion criteria within Cohort Definitions. Which method do you think is the best way to capture the population with a diagnosis that requires another prior event: using Characterizations or Cohort Definitions? I can see that the answer might depend on the complexity of the study design and the research question, however, I am simply trying to assess the best method for a prevalence estimate.

I believe these questions are related to my conceptual misunderstanding so I would highly benefit from your input in terms of how to utilize Atlas in the best way possible. The questions I have all concern US claims dataset.

Thank you and Happy New Year!

t