Disprepancy in subject count between ATLAS and Cohort Diagnostics tool

mainguyen · April 28, 2022, 3:53pm

Dear all,

I’ve just discovered that the number of subjects for a certain concept in the Cohort Diagnostics tool (under the Concepts in Data Source tab) and in ATLAS (Data Sources tab → Condition Occurrence) do not align. For instance, the same concept id 197320 has n=1,277,159 subject in Cohort Diagnostics, but n=2,016,790 in ATLAS (see attached photos).

I wonder if I am missing anything?

Best wishes,

@schuemie @Gowtham_Rao @Patrick_Ryan do you know what could be the reason for this?

Chris_Knoll · April 28, 2022, 4:47pm

To be more precise: I think what you are showing is a discrepancy between Achilles and CohortDiagnostics because Atlas is just presenting the results from the Achilles package. I’m not sure if CohortDiagnostics uses its own implementation to calculate concept counts.

For your refrence, the Achilles analysis specification for the person count by concept_id for conditions is defined here, and the implementation of the anlaysis is here.

Since it is simple and small i’ll paste it here:

-- 400	Number of persons with at least one condition occurrence, by condition_concept_id

--HINT DISTRIBUTE_ON_KEY(stratum_1)
SELECT 
	400 AS analysis_id,
	CAST(co.condition_concept_id AS VARCHAR(255)) AS stratum_1,
	CAST(NULL AS VARCHAR(255)) AS stratum_2,
	CAST(NULL AS VARCHAR(255)) AS stratum_3,
	CAST(NULL AS VARCHAR(255)) AS stratum_4,
	CAST(NULL AS VARCHAR(255)) AS stratum_5,
	COUNT_BIG(DISTINCT co.person_id) AS count_value
INTO @scratchDatabaseSchema@schemaDelim@tempAchillesPrefix_400
FROM @cdmDatabaseSchema.condition_occurrence co
JOIN @cdmDatabaseSchema.observation_period op  ON co.person_id = op.person_id
  AND co.condition_start_date >= op.observation_period_start_date
  AND co.condition_start_date <= op.observation_period_end_date
GROUP BY co.condition_concept_id;

One thing that immediately jumps out is that Achilles is only using records that are found within an observation period (see join on observation_period). This makes sense because Achilles is trying to report on events that could be used for cohort entry events. I’m not sure if CohortDiagnostics is applying the same rule.

Edit:

Looking at this further, my explanation above would lead to a lower count in Atlas vs. Cohort Diagnostics, vs what you are seeing which is a higher count in Atlas. The only thing I can think of here is that Atlas is showing the entire database (via Achilles) and possibly the cohort you are viewing in Cohort Diagnostics is a subset of the database population (which would mean CD will show you a lower count than in Achilles).

Gowtham_Rao · April 28, 2022, 5:55pm

(i had to look at the code to be sure) -Tracing back

The shiny app uses this query to populate the menu “Concepts in Data Source”.

Its populated using this logic here

Its a different from Achilles logic because we are grouping by source_concept_id, concept_id like so

	SELECT observation_concept_id AS concept_id,
		observation_source_concept_id AS source_concept_id,

So - we dont have true unique person count by conceptId. We only have it for the combination of conceptId + sourceConceptId.

To make this work, we then summarize it here

github.com

OHDSI/CohortDiagnostics/blob/3836704a674e68964dfa6ef13f8777bcea60780d/inst/shiny/DiagnosticsExplorer/server.R#L1748-L1790

    
      
          if (input$includedType == "Source fields") {
            data <- data %>%
              dplyr::filter(.data$sourceConceptId > 0) %>%
              dplyr::select(
                .data$databaseId,
                .data$sourceConceptId,
                .data$sourceConceptName,
                .data$sourceVocabularyId,
                .data$sourceConceptCode,
                .data$conceptSubjects,
                .data$conceptCount
              ) %>% 
              dplyr::rename("conceptId" = .data$sourceConceptId,
                            "conceptName" = .data$sourceConceptName,
                            "vocabularyId" = .data$sourceVocabularyId,
                            "conceptCode" = .data$sourceConceptCode) %>%
              dplyr::group_by(.data$databaseId,.data$conceptId, .data$conceptName, .data$vocabularyId, .data$conceptCode) %>% 
              dplyr::summarise(conceptSubjects = sum(.data$conceptSubjects),
                               conceptCount = sum(.data$conceptCount), 
                               .groups = 'keep') %>%

This file has been truncated. show original

That summarization is the reason why you see the difference.

Tagging @jpegilbert for awareness as this may be something we want to take up in future versions.

mainguyen · April 29, 2022, 11:27am

Thanks for the clear explainations @Gowtham_Rao @Chris_Knoll. It would be great to see future developments taking this into account