OHDSI Home | Forums | Wiki | Github

How to use Atlas to select all patients in a table - Primary Events validation

Hello Everyone,

After watching YouTube tutorial on Atlas, I am trying to select all unique patients in my condition_occurrence table through Atlas like as shown below. So, I didn’t impose any observation period criteria but still I don’t get the same count of records as in my database.

  1. Earliest event - one record per patient - I expect to see 3712 but it’s 3236

  1. All events - All patient records - I expect to see 3712 records but its again 3236.

As you can see, I have changed the restrict initial events section. I also verified the SQL generated manually. and the only difference is where the filter is applied for earliest event WHERE P.ordinal = 1. Query does do select * from condition_occurrence.

My cohort exit is end of continuous observation, so there should be atmost one record for each person. Right?

My observation periods are based on visit start and end dates. So, I don’t see any problem due to this.

When my objective is only to select all patients from a specific table, do I still have to be concerned about concept_ids?

Am I missing something here?

Here is the cohort created in public atlas click here

@Chris_Knoll - Can you help?

Hi @Akshay,

There are some recent posts with the answer to your question.



@gyeol99 - Thanks for the links but unfortunately I am still not clear.I am not confused between records and person/people.

When I issue a select count(distinct person_id) from cdm.condition_occurrence, I get 3712 unique persons. So here I am trying to reproduce the same thing. So I have chosen earliest event per person.

And regarding the cohort exit, I have chosen end of continuous observation. I guess it doesn’t make any difference . Because I have chosen only earliest event per person.

Now regarding all events per person, same as above, only difference is I have chosen all events in the drop down and cohort exit is end of continuous observation. So in this case, based on Atlas logic, if it collapses multiple episodes of a person into one episode, even then I should see 3712 persons and 3712 records. Right?

Why is that I don’t see that?

Am I still missing something here?

Only condition occurrence events that fall within an observation period will be identified.

Change your query to:

select count(distinct co.person_id) from cdm.condition_occurrence co
join cdm.observation_period op on co.person_id = op.person_id
  and co.condition_start_date >= op.observation_period_start_date
  and co.condition_start_date <= op.observation_period_end_date

See if that gives you 3236 people.

Second, you may get records > people if you have multiple observation periods per person, as each cohort episode will be created for each observation period (if multiple OPs exist with the condition).

1 Like

@Chris_Knoll - Ahh I see. yes, it gives the same count. Wow…But I have a quick question. In our case, we chose minimum of visit start date as observation_period_start_date and maximum of visit_end_date (assuming the patient had multiple visits to hospital) asobservation_period_end_date`. In this case, I am just thinking how can a record be beyond these limits? Is it even possible?

Am I doing anything incorrect?

I did see the observation period clause in generated SQL but wasn’t expecting this based on our observation period table definition

Can you provide an example to help me understand this if possible?

This could be a problem with your ETL of the data. You could write a quick query to find out if there are any conditions that occur outside of a visit? There must be something given your query of the Condition Occurrence table…without knowing your ETL, I can’t explain how that would happen. Or, it’s possible that your diagnosis are found independently of a visit…maybe something patient reported? I don’t know.

Cohort’s aren’t about ‘how many people are there?’…cohorts are about ‘when do patients satisfy criteria’. So, when we identify a person that should appear in a cohort, and we say ‘stay in until the end of observation’, this requires those events that contribute time to cohort episodes to be contained within an observation period.

@Chris_Knoll - Thanks for your help. But May I please check with you on which of the below is more accurate/correct?

Query 1

select distinct person_id from cdm.condition_occurrence where person_id not in (select distinct co.person_id from 
cdm.condition_occurrence co
join cdm.observation_period op on co.person_id = op.person_id
and **co.condition_start_date** >= op.observation_period_start_date    # cond 1
and **co.condition_end_date** <= op.observation_period_end_date)     # cond 2

Query 2

select distinct person_id from cdm.condition_occurrence where person_id not in (select distinct co.person_id from 
cdm.condition_occurrence co
join cdm.observation_period op on co.person_id = op.person_id
and **co.condition_start_date** >= op.observation_period_start_date    # cond 1
and **co.condition_start_date** <= op.observation_period_end_date)    # cond 2

Because both doesn’t yield the same results. You chose the 2nd query because you think it is possible that patients’ drug exposure or condition end dates may or may not be recorded in hospital EHR system beyond observation periods?

Let me just give an example. So that I can verify and correct our ETL.

Patient A comes to hospital on 1st Jan 2018 9:00 AM and was diagnosed for T2DM (for the first time) on 1st Jan 10:45 AM. Later he visits hospital multiple times on 1st Feb 9 AM and leaves on 1st Feb 4 PM, 1st Mar 9AM and leaves on 2nd Mar 4 PM ,1st Apr 9AM and leaves on 3rd Apr 4 PM. and every time he visits hospital for the same condition, I don’t think there will not be multiple records from him condition_occurrence table. Now his observation period start date would be 1st Jan 2018 and observation period end date would be 3rd Apr 2018.

So now, we have the patient A’s full hospital visit journey. So your reason for choosing domain_table_start_date in where clause (as shown in 2nd query) for comparing with observation period dates is because you think that patient condition might go on beyond the hospital visit data that we might have. Because he may not turn up again to the same hospital.

But based on cdm logic, even if a patient doesn’t have condition_end_date we make it as condition_start_date + 1 day. So why can’t we use it as co.condition_end_date <= op.observation_period_end_date in 2nd query as well?

Another question on this, Might be I am wrong. Would you mind to help me understand this better? I felt that whenever a condition is recorded in EHR system, it is within a specific visit. For example, A patient entering hospital is recorded as in time and diagnosis is recorded in dignosis/patient records table and exit is recorded again as out time . So I felt patient’s all interaction with hospital is always accompanied by a visit or recorded within visit duration. But like you mentioned above, I see few scenarios here

  1. a patient has a condition but it is not within his visit duration/observation period. (person is present in both condition and visit tables but dates mismatch) Is it possible to have a condition without a corresponding visit? Am I right to understand that this is data issue or scenarios like this is possible? Any simple example like above if you have time?
  2. a patient has a condition but he doesn’t have any visit record at all. (person is present in condition but absent (o records) in visit table). same as above. without entering hospital, can he have a condition record alone? When can this be possible? Any simple example like above if you have time?

So, when you mean diagnosis are independent of visit like we see in our data, this can only be checked with hospital institution (Data provider) on how is this possible and we can’t do anything about it.Right?

But when you mean patient reported diagnosis , does it mean the hospital has just captured the what the patient said without any tests (because the patient had undergone the tests already in some other hospital and knows that he has been diagnosed of T2DM and the new hospital just records it in their system)? Might be a silly question. But trying to learn and understand.

The end date of the condition isn’t really important for the cohort definition query: we only use events that start within an observation period. When we determine the duration of the cohort episode, the episode will be censored at the end of the containing observation period (if the event persistence exceeds the observation period boundary). In CDM terms, the observation period asserts when you believe the person is under direct observation / has presence in the dataset. if the person’s observation period ends, then the duration in the cohort also ends, even if the condition observation/ drug exposure is recorded as persisting past the observation period end date. This is just a fundamental principle of the observation period in the CDM, and the cohort definitions honor the observation period boundaries when building the cohort episodes.

It’s simply a matter of finding only events that start inside the observation period. Conversely, if an event starts before the observation period, but ends during an observation period, we won’t find it either. We only pick events where we can establish the starting point within an observation period.

The remainder of your questions, I can’t answer for you, you have to refer to the ETL of your data source to understand how diagnoses and drug exposures are captured, their relation to visits, and how visits become observation periods.Your statement that:

Why not? Every time a person comes into the hospital, are they not going to write down their chief complaint if it was recorded sometime in a prior visit? I don’t think you should assume perfect capture of the data.

I think you have a good QC query where you can identify those people that do not have at least one domain event contained within an observation period (select person_id where not exists (...)). the question is, why does this data condition exist, and do you need to rethink your OBSERVATION_PERIOD construction logic?

I’m not an expert on EHR systems, or transforming those into the CDM model, but I’m sure there’s people in the community that have had to face this challenge, and may give you some thoughts and strategies for solving this puzzle. There’s probably not a ‘best solution’, you just have to be clear about the consequences of your design, and how you should consider it when performing your observational study.

1 Like

Yes. I understand now. If we would like to include all records, only way is modify observation period logic for patients to something like 01-01-1899 to 01-01-2999 kind of thing. But are there any downsides to it?

Very much appreciate your time and inputs

Yes: if you say the OP end is 01-01-2999 you will find that people in your cohort will have centuries of person-time in your cohort.

Edit: by ‘people having centuries of cohort time’ I mean one person diagnosed in 2018 will have centuries of cohort time if their observation period ends in 2999.


This problem is similar to cohorts based on e.g. a genomic marker. It has always existed since birth. What do we tell people there, @Chris_Knoll?

I tell them that you can define inclusion criteria that ignores observation period boundaries, and put the event (the observation of the genomic marker) in the month/year of their birth. I tell them that if they are confident that they are truly observing this person from birth, then define the Observation Period as such. Regardless, you will need to pick the events that were during observation from which the cohort episodes are built.

That works. We may want to explain that in the Book chapter. @krfeeney?

1 Like

Added to the backlog for V2 revisions. :wink:

1 Like

Hi @Chris_Knoll,

I was trying to inspect the generated SQL in Atlas. Even though I have set the limit initial events to “All Events per person”, I see the where clause with ordinal = 1.

For example - The below query just creates an empty table which is fine

CREATE TEMP TABLE inclusion_events  (inclusion_rule_id bigint,
person_id bigint,
event_id bigint); 

But in the below query, May I know why do we have ordinal = 1 in the where clause? For All events per person?

CREATE TEMP TABLE included_events

WITH cteIncludedEvents(event_id, person_id, start_date, end_date, op_start_date, op_end_date, ordinal)  
AS (
SELECT event_id, person_id, start_date, end_date, op_start_date, op_end_date, row_number() over 
  (partition by person_id order by start_date ASC) as ordinal
 select Q.event_id, Q.person_id, Q.start_date, Q.end_date, Q.op_start_date, Q.op_end_date, 
 SUM(coalesce(POWER(cast(2 as bigint), I.inclusion_rule_id), 0)) as inclusion_rule_mask
  from qualified_events Q
  LEFT JOIN inclusion_events I on I.person_id = Q.person_id and I.event_id = Q.event_id
  GROUP BY Q.event_id, Q.person_id, Q.start_date, Q.end_date, Q.op_start_date, Q.op_end_date
   ) MG -- matching groups

   event_id, person_id, start_date, end_date, op_start_date, op_end_date

  cteIncludedEvents Results
  **WHERE Results.ordinal = 1**;

q1) Why does the generated query pick only one event when our criteria is to use All events per person?

Please note that I don’t have any inclusion criteria. Am just dealing with Initial events/Cohort Entry events SECTION only in Atlas and not gone to inclusion criteria section yet.

q2 ) Just confirming that the naming of temp table as included_events has got nothing to do with Inclusion criteria section. Am I right?

q3) This is how my screen looks like and I guess it doesn’t matter what’s there inside the Restrict Initial Events as I haven’t enlarged/expanded it?


Can you help?

There’s 3 places you can limit events:

  1. The cohort entry events
  2. after applying the entery event restriction
  3. after applying inclusion rules

You probably have another option set to ‘earliest’ elsewhere (lower) in the cohort definition.

1 Like


As shown in the below screenshot,


Eve though I leave section 2 and 3 untouched, it is still going to take default values for limit events?

I mean in this screenshot, I have just expanded only to show it to you but earlier, I didn’t expand Restrict Initial Events or set any inclusion criteria. So even then these fields are still considered with whatever values they have. Guess that’s how it works. got it. Thanks for your help

If you have the initial event restriction collapsed, then it won’t apply the ‘limit events’ function, so you don’t have to expand the ‘initial event restriction’ to turn the limit off.

Basiaclly, if you see it, it is applying it. If you don’t have the ‘initial event restriction’ expanded, you don’t see it, so it is not applying it.