Atlas Characterizations error -org.ohdsi.webapi.exception

gongliangz · March 29, 2023, 1:39pm

Hi team,
I got an error message when run Characterizations in Atlas (see below). Need help to fix it. Thanks
org.ohdsi.webapi.exception.AtlasException: java.lang.RuntimeException: java.util.concurrent.ExecutionException: org.springframework.jdbc.Uncategorized SQLException: StatementCallback; uncategorized

SQLException for SQL [<START_WITH>t1 AS (
SELECT cohort_definition_id,
COUNT() AS cnt
FROM tmp_v0224.temp_cohort_b4890s3e
WHERE cohort_definition_id IN (422)
GROUP BY cohort_definition_id
),
t2 AS (
SELECT cohort_definition_id,
COUNT() AS cnt,
MIN(concept_count) AS min_concept_count,
MAX(concept_count) AS max_concept_count,
SUM(CAST(concept_count AS BIGINT)) AS sum_concept_count,
SUM(CAST(concept_count AS BIGINT) * CAST(concept_count AS BIGINT)) AS squared_concept_count
FROM tmp_v0224.b4890s3econcept_count_data
GROUP BY cohort_definition_id
)
<END_WITH>
CREATE TABLE tmp_v0224.b4890s3econcept_count_stats
USING DELTA
AS
(SELECT
t1.cohort_definition_id,
CASE WHEN t2.cnt = t1.cnt THEN t2.min_concept_count ELSE 0 END AS min_value,
t2.max_concept_count AS max_value,
CAST(t2.sum_concept_count / (1.0 * t1.cnt) AS FLOAT) AS average_value,
CAST(CASE
WHEN t2.cnt = 1 THEN 0
ELSE SQRT((1.0 * t2.cntt2.squared_concept_count - 1.0 * t2.sum_concept_countt2.sum_concept_count) / (1.0 * t2.cnt*(1.0 * t2.cnt - 1)))
END AS FLOAT) AS standard_deviation,
t2.cnt AS count_value,
t1.cnt - t2.cnt AS count_no_value,
t1.cnt AS population_size
FROM
t1
INNER JOIN t2
ON t1.cohort_definition_id = t2.cohort_definition_id)]; SQL state [HY000]; error code [500051]; [Simba]SparkJDBCDriver ERROR processing query/statement. Error Code: 0, SQL state: org.apache.hive.service.cli.HiveSQLException: Error running query: [PARSE_SYNTAX_ERROR] org.apache.spark.sql.catalyst.parser.ParseException:
[PARSE_SYNTAX_ERROR] Syntax error at or near ‘<’.(line 1, pos 0)

== SQL ==
<START_WITH>t1 AS (
^^^
SELECT cohort_definition_id,
COUNT() AS cnt
FROM tmp_v0224.temp_cohort_b4890s3e
WHERE cohort_definition_id IN (422)
GROUP BY cohort_definition_id
),
t2 AS (
SELECT cohort_definition_id,
COUNT() AS cnt,
MIN(concept_count) AS min_concept_count,
MAX(concept_count) AS max_concept_count,
SUM(CAST(concept_count AS BIGINT)) AS sum_concept_count,
SUM(CAST(concept_count AS BIGINT) * CAST(concept_count AS BIGINT)) AS squared_concept_count
FROM tmp_v0224.b4890s3econcept_count_data
GROUP BY cohort_definition_id
)
<END_WITH>
CREATE TABLE tmp_v0224.b4890s3econcept_count_stats
USING DELTA
AS
(SELECT
t1.cohort_definition_id,
CASE WHEN t2.cnt = t1.cnt THEN t2.min_co

Chris_Knoll · March 29, 2023, 2:39pm

Is <START_WITN> valid syntax for Spark? I don’t know anything about the dialect of spark but it looks like this is a sqlRender issue.

Thomas_White · March 29, 2023, 3:58pm

<START_WITH> is not valid syntax in Spark. I see it referenced in sqlRender commits 5 and 7 months ago, but not how sqlRender uses that to generate Spark-valid code.

One of the commits was by @schuemie, so perhaps he has insights into this.

Thomas_White · March 30, 2023, 2:43am

@Chris_Knoll and @schuemie, we have a characterization that we ran in October using Spark (Databricks) that ran fine. But today, the same characterization fails with the <START_WITH> message. Some features work and others don’t, but I haven’t been able to track down which. Here are the list of features we used when this worked in October '22:

ID	Name
1	Measurement Range Group Short Term
2	Condition Group Era Start Long Term
3	Drug Group Era Start Medium Term
4	Condition Era Short Term
7	Drug Group Era Short Term
10	Condition Era Any Time Prior
18	Condition Era Start Short Term
25	Drug Group Era Overlapping
31	Drug Exposure Short Term
34	Observation Short Term
36	Dcsi
37	Drug Era Start Short Term
43	Drug Era Short Term
44	Drug Era Overlapping
49	Charlson Index
50	Measurement Short Term
56	Observation Medium Term
58	Measurement Value Short Term
70	Demographics Ethnicity
71	Demographics Age Group
72	Demographics Race
73	Demographics Prior Observation Time
74	Demographics Gender
78	Chads 2
82	Chads 2 Vasc
88	Drug Era Medium Term
91	Distinct Measurement Count Short Term
100	Visit Count Medium Term
101	Visit Count Short Term

Chris_Knoll · March 30, 2023, 2:53am

Ok, sounds like it is a problem with SqlRender, but want to wait for @schuemie to chime in.

gongliangz · March 30, 2023, 1:03pm

Thank all!

Chris_Knoll · March 30, 2023, 1:16pm

So this was done by: Ali Bellamine (github: alibell) (is this you, @alib ?) If so, could you provide some information as to the use of <START_WITH>? It seems to be some sort of tokenization to find the start and ends of CTEs.

Thomas_White · April 5, 2023, 4:58pm

@schuemie , any insights on this? It impacts Characterizations (which used to work), Cohort Pathways, and Heracles reports for cohorts.

In Databricks, the following syntax is not allowed:

WITH cte AS (SELECT ...) CREATE TABLE tmp.table AS (SELECT * FROM cte);

Instead, the syntax needs to be:

CREATE TABLE tmp.table AS WITH cte AS (SELECT ...) SELECT * FROM cte;

I think that is what the <START_WITH> and <END_WITH> syntax is trying to process; but the code isn’t doing the final substitutions. You can see the <START_WITH> clauses in replacementsPatterns.csv, but I don’t see how and where the <START_WITH> and <END_WITH> are replaced.