Logging full set SQL Atlas plans to execute before it runs the SQL (to help debugging)

Thomas_White · October 7, 2022, 6:59pm

Verbose logging of the SQL that Atlas plans to execute (ideally before it runs the sequence of queries) would really help debugging.

We’ve been using Spark/Databricks for a while and are enjoying the performance, but sometimes find Atlas features that aren’t working. For most user interactions in Atlas, multiple SQL commands are generated (from a template). I’d like to log the full set of SQL Atlas plans to run before it starts running any of it so that I can more easily detect when and where queries fail. Ideally those could be logged complete with content substitution; but I’d be OK with just logging the set of SQL templates if that is more feasible.

This would have a side benefit of letting me archive certain queries so that I can re-run them on a schedule.

For example, issue we’re currently trying to debug with Spark include:

Navigating user concept hierarchy and relationships with Spark is slower than using SQL Server
Running Heracles reports fails near last step (inserting data into heracles_results tables)
Running Cohort Pathways fails partway through, but the Tomcat/Catalina logs flag the wrong query as the culprit (e.g. one that does run properly), so we’re having trouble tracking down which downstream query is causing the issue.

Alternatively or in addition, I presume that all Atlas interactions ultimately orchestrate the creation of a set of WebAPI calls (and do parameter substitution + SqlRender to generate the desired outputs before submitting the SQL for execution). Is there documentation that shows what sequence of API calls is generated for each type of interaction in Atlas?

Ajit_Londhe · October 11, 2022, 1:30pm

Hi Thomas,

Perhaps the unit tests of WebAPI could help in showing the various SQL patterns. But some queries are readily available in the QueryLibrary (https://github.com/ohdsi/querylibrary), especially concept based ones.

On the Spark side:

We observed the same issue, Databricks doesn’t really have proper indexing to help with the concept relationship / hierarchy queries. We chose to use Postgres for this.
We’ll need to raise a WebAPI git issue. My guess is that the tasklet that executes the heracles SQL commands (https://github.com/OHDSI/WebAPI/blob/master/src/main/java/org/ohdsi/webapi/cohortanalysis/CohortAnalysisTasklet.java#L90) needs to be wrapped in SqlRender’s sparkHandleInsert function to ensure the insert command is reconstructed to have all table fields present.
Perhaps you can share a cohort pathway json with my colleague @Brad_Rechkemmer from Amgen and @alex-odysseus from Odysseus, so they can try in their Databricks environment.

Thanks,
Ajit

Thomas_White · October 13, 2022, 12:43pm

@Ajit_Londhe - Thanks.

@Brad_Rechkemmer , @alex-odysseus
ohdsi_treatment_pathway.json.xml (9.3 KB)
Attached is a JSON file (with an XML extension so that it would upload) for a standard simple diabetes treatment pathway. This fails partway through when running on DataBricks, but runs fine on SQL Server.