CohortDiagnostics Adding Additional Concept Codes Unexpectedly To Phenotype Definition?

Hi folks and Happy New Year! :partying_face:

Summary: I am seemingly encountering an unexpected behavior with CohortDiagnostics when ran on a phenotype definition I have written where I seem to receive results for concepts I did not specify in my definition. For discussion reference, here is the phenotype definition: Lung Cancer.

Click To See R sessionInfo Output
R version 4.4.3 (2025-02-28)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so
        LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=C.UTF-8          LC_NUMERIC=C
 [3] LC_TIME=C.UTF-8           LC_COLLATE=C.UTF-8
 [5] LC_MONETARY=C.UTF-8       LC_MESSAGES=C.UTF-8
 [7] LC_PAPER=C.UTF-8          LC_NAME=C.UTF-8
 [9] LC_ADDRESS=C.UTF-8        LC_TELEPHONE=C.UTF-8
[11] LC_MEASUREMENT=C.UTF-8    LC_IDENTIFICATION=C.UTF-8

time zone: localtime
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base

other attached packages:
[1] CohortGenerator_0.12.2   R6_2.6.1
[3] CohortDiagnostics_3.4.2 FeatureExtraction_3.11.0
[5] Andromeda_1.1.1         dplyr_1.1.4
[7] DatabaseConnector_6.4.0

loaded via a namespace (and not attached):
 [1] vctrs_0.6.5        cli_3.6.5        rlang_1.1.6
 [4] DBI_1.2.3          purrr_1.1.0      renv_1.1.5
 [7] generics_0.1.4    rJava_1.0-11     glue_1.8.0
[10] bit_4.6.0         tibble_3.3.0     fastmap_1.2.0
[13] lifecycle_1.0.4  memoise_2.0.1    duckdb_1.4.0
[16] compiler_4.4.3   SqlRender_1.19.3 RSQLite_2.4.3
[19] blob_1.2.4       pkgconfig_2.0.3 tidyr_1.3.1
[22] tidyselect_1.2.1 pillar_1.11.1    magrittr_2.0.4
[25] tools_4.4.3      bit64_4.6.0-1   cachem_1.1.0

Steps: What I do is this:

  1. Run exportCohortDefinitionSet and generateCohortSet from CohortGenerator – this works perfectly and the cohort gets generated
  2. Generate cohort statistics via executeDiagnostics from CohortDiagnostics – this works as expected on the cohort with no errors (for details on execution settings, see below)
  3. Prepare statistics for Shiny viewer using createMergedResultsFile from CohortDiagnostics – works as expected and a SQLite DB is made
  4. Review results file and find additional concepts included that are not originally seen within my ATLAS instance – very confused here

For 2, here is the configuration I had made:

Click To View executeDiagnostics Settings
executeDiagnostics(cohortDefinitionSet,
  connectionDetails = connectionDetails,
  cohortTable = cohortTable,
  cohortDatabaseSchema = cohortDatabaseSchema,
  cdmDatabaseSchema = cdmDatabaseSchema,
  exportFolder = exportFolder,
  databaseId = "Pharmetrics",
	databaseDescription = "Lab Database",
  minCellCount = 11,
  runInclusionStatistics = FALSE,
  runIncludedSourceConcepts = TRUE,
  runOrphanConcepts = FALSE,
  runTimeSeries = FALSE,
  runVisitContext = FALSE,
  runBreakdownIndexEvents = TRUE,
  runIncidenceRate = FALSE,
  runCohortRelationship = FALSE,
  runTemporalCohortCharacterization = FALSE,
  runFeatureExtractionOnSample = FALSE
)

Problem: To give some additional context to this, here is a screenshot of what I mean:

On the left of the image is my phenotype definition and on the right is the CohortDiagnostics results viewer. On the right is the code, 35206086, which shows up in my results file but is explicitly not present in my ATLAS instance as shown on the left with “No Matching Records Found” in the Included Concepts tab for my definition. The ATLAS and CohortDiagnostics are produced from the exact same database.

The principle reason why I am worried about this is why are additional concepts seemingly being added in CohortDiagnostics? It makes me question if there is something amiss with my definition in general. My thoughts about what is going on here are as follows:

  • Is there some “strict” setting I am not thinking of in executeDiagnostics to toggle?
  • Could it be some kind of strange underlying vocabulary issue that while I don’t see this code in ATLAS, it gets added into analyses by executeDiagnostics?
  • Some maladaptive interface behavior I am not accounting for during CohortGenerator::generateCohortSet being ingested by downstream CohortDiagnostics use?
  • Some other cause?

I am happy to provide additional information but I am quite stumped by this. Any thoughts?

Thanks!

~ tcp :deciduous_tree:

1 Like

Hi Jacob

The concept you are looking at is non-standard and is mapped to a standard one. In cohort diagnostics you see it as condition_source_concept_id, meaning that it was in the source data. You can find the code in another tab (Included source codes) or look for the standard concept in included concepts. It somehow got in your definition of Lung cancer.

The methods are fine.

1 Like

Ah! This is a classic case of the XY problem - Wikipedia on my part! I am so glad you found the root of the problem here laying within my definition as opposed to CohortDiagnostics! So instead of this being a problem within CohortDiagnostics, this is actually a resounding success of CohortDiagnostics finding a problem with my definitions behaving badly!

Thanks so much @zhuk, I’ll go back to looking at my definitions – this resolves my concerns completely as I now know exactly where the problem lies.

Cheers!

~ tcp :deciduous_tree: