OHDSI Home | Forums | Wiki | Github

[Atlas] Feature analysis - sql codes for existing features


I would like to modify sql codes for an existing feature - “Condition Occurrence Any Time Prior” by adding a time restriction, i.e. 6 month prior, but I get lost in the Github page (attached below)… seems like only sql codes for some of the features are available. Am I missing something? Or, is it possible to get access to the sql codes for the default options in the feature analysis?

Thanks in advance for any advice and help.


My default, the Condition Occurrence Medium Term will use the medium term window, which is 180d (6 months). All the xxx Medium Term analysis will use that, so you can use the included ‘presets’ in Atlas to look for features in that time window.

Thanks Chris!

I thought that the feature (condition occurrence medium term) is designed to search for conditions that occurred six months after the index date. Am I misunderstanding it?

The time window we are interested in is 6 months prior and 6 month after the index date. So I would like to modify the sql codes for this feature and change the time window only. Do you have any suggestion on how to make this happen?


Go into your Characterization nav in Atlas, and you will see that there are two tabs: characerizations and features:

Go into features, and create a new feature. You’re going to select custom sql, and you will paste in this SQL:

-- Custom analysis producing same results as Feature Extraction's "One covariate per condition_concept_id in the condition_occurrence table during the time period prior to index."
select cast(condition_concept_id as bigint)*1000 + @analysis_id as covariate_id,
       c.concept_name as covariate_name,
       condition_concept_id as concept_id,
       count(*) as sum_value,
       count(*)*1.0 / stat.total_cnt*1.0 as average_value
from (select distinct condition_concept_id,
      from @cohort_table cohort
        inner join @cdm_database_schema.condition_occurrence on cohort.subject_id = condition_occurrence.person_id
      where condition_start_date >= dateadd(d,-180,cohort.cohort_start_date)
      and   condition_start_date <= dateadd(d,180,cohort.cohort_start_date)
      and   condition_concept_id != 0
      and   cohort.cohort_definition_id = @cohort_id) condition_entries
  join @cdm_database_schema.concept c on condition_entries.condition_concept_id = c.concept_id
  cross join (select count(*) total_cnt
              from @cohort_table
              where cohort_definition_id = @cohort_id) stat
group by condition_concept_id,

I’ve coded this query to return Condition Occurence records between 180d prior and 180d after index (the cohort start date)

1 Like

This is super helpful!! Much appreciated!

I have a follow up question - every feature I created by importing the sql codes (w/o modification) failed the characterization function and I suspect that it might be related to the problem you mentioned in the other post:

where I need to replace the place holders of @cdm_database_schema and @cohort_table with the cohort and schemas in the local CDM. But I’m not quite sure how to find these information and found a post probably helpful but I can’t understand Korean…

My understanding is that, I need to specify the following four variables in the sql codes, right? Should I replace the @cdm_databse_schema with the name of the data I use in the local CDM? I am also confused about what to do with the other three variables.

Thanks for your time and answers!

You need to leave the place-holder variables in there, they are handled by either the Source that you run the analysis on or are provided a run time by the analysis execution:

  • cdm_database_schema: Comes from the target Source you are executing on, and is the name of the CDM schema. This is a parameter because you may have multiple CDMs configured, each with their own CDM schema, so this needs to be paramaterized.
  • cohort_table: The underlying library is FeatureExtraction, which makes no assumption about what the name of your cohort table is, but WebAPI knows what the cohort table name should be, so during analysis execution, WebAPI will set this paramater (but it still needs to be defined in the analysis sql)
  • cohort_id: This will be the id of the Target Cohort you specify in your Characterization design, and WebAPI will specify the target cohort as the cohort_id.
  • analysis_id: this will be an id representing the specific characterization analysis you are performing: all the FeatureExtraction presets have their own IDs, but any custom features (either Criteria-based or Sql-based) will have their own unique ID, and this paramater is passed in by WebAPI.

The short answer is you don’t change them, but you need to have them there so that the software can do the proper replacements before executing the query.

As far as the failures, you can check your WebAPI log, and look for things like ‘table does not exist’ or 'permission denied messages.

Hi Chris -

I tested features created by sql codes separately and it turns out that the code you provided work fine! Thank you! But I checked the other two, they all have ‘@aggregated’ and ‘@temperal’ and failed the running (as attached).

They are the caresite (which has been removed today I guess) and the Charlson index(we updated the concept ids for health condition) sql codes I copied from Github.


The aggregated and temporal is not supported parameters. Meaning: when WebAPI executes, it only has paramters for the previous list, but aggregated and temporal is a different type of functionality from FeatureExtraction that isn’t used when using custom sql.

1 Like

Interesting… So to be able to use the custom sql for these two purposes (which are to obtain the care_site and compute the updated Charlson comorbidity index), I will have to either 1) use the custom sql code in R by using FeatureExtration function or 2) rewrite sql that avoids using these two parameters, right?

yes, but I think option 2 is the way to go. Let me give more detail:

FeatureExtraction supports a few modes of execution: patient-level and aggregate, and within those two, you can get ‘features’ within a time window (indicated by a time_id), or you can get a result without a time_id (null) where the analysis_id would inform the time window (short/medium/long). The primary usecase of FeatureExtraction (surprisingly) is to extract features, or patient-level characteristics that can be used to build models for prediction, propensity scores, etc. A secondary use of these features is to get population-level summary statistics.

Cohort Characterization is all about population-level summaries (as in, what’s the proportion of condition X within a time window relative to index (cohort start)). A particular FeatureExtraction analysis can summarize if it has the ‘aggregated’ param, which controls how to group by - sum to summarize. Atlas lets you use those ‘preset’ analyses that support aggregation to get the summary statistics. The custom SQL and criteria type of features ONLY work in aggregate, so your sql needs to produce the following columns:
covariate_id : usually computed as : cast(condition_concept_id as bigint)*1000 + @analysis_id
covariate_name: usually the concept name
concept_id: the concept ID that represents this observational data you are summarizing
sum_value, the number of people you counted
average_value: the proportion of people with the concept calcualted as: count(*)*1.0 / stat.total_cnt*1.0

Fun fact: it may not be obvious calling this column ‘average value’ but the average is the sum of values / number of values, but when dealing with binary (0 and 1) covariates, then the average is the proportion of people with the covariate (sum of 1’s = people with covariate / total people (number of values) = people with cases / total people = proportion of people with case. We use these types of binary features to produce the characerization reports in atlas to show 'proportion of condition/drug/procedure within a time window (short medum long). We would not use something like average age in this way (because the values sumed would be ages between 0 and 99) but instead we can put people into ‘age groups’ and have yes/no flags for if a person is in an age group, and create proportions by age group (ie: 35% of peole are between 30 and 40).

Circling back to the sql you want: if you want to provide the sql to yield the staristics you want to summarize, just make sure you yield those specific columns, and it will show up in the characterization report. Start with the same I gave you (which just counts people who have any condition concept within 30d before and 30d after index) and then play around with expanding that to locate (and count) the people you would like to summarize in the report.

1 Like

Thank you so much, Chris - for the detailed clarification and prompt response!

Appreciate it,