I have been trying to implement cohort diagnostics for a BgQuery dataset. I got multiple errors but one of them was mainly about not allowing an aggregation function in Group By. Anyone had a similar problem or error?
Error: java.sql.SQLException: [Simba]BigQueryJDBCDriver Error executing query job. Message: Column 6 contains an aggregation function, which is not allowed in GROUP BY at [53:19]
SQL: select cohort_definition_id cohort_id
,period_begin*
,calendar_interval*
,count() records
,count(distinct subject_id) subjects*
,sum(date_diff(if(safe_cast(case *
when cohort_end_date >= period_end*
then period_end*
else cohort_end_date*
end as date) is null,parse_date('%Y%m%d', cast(case *
when cohort_end_date >= period_end*
then period_end*
else cohort_end_date*
end as string)),safe_cast(case *
when cohort_end_date >= period_end*
then period_end*
else cohort_end_date*
end as date)), if(safe_cast(case *
when cohort_start_date >= period_begin*
then cohort_start_date*
else period_begin*
end as date) is null,parse_date('%Y%m%d', cast(case *
when cohort_start_date >= period_begin*
then cohort_start_date*
else period_begin*
end as string)),safe_cast(case *
when cohort_start_date >= period_begin*
then cohort_start_date*
else period_begin*
end as date)), day) + 1) person_days*
,count(case *
when cohort_start_date >= period_begin*
and cohort_start_date <= period_end*
then subject_id*
else null*
end) records_incidence*
,count(distinct case *
when cohort_start_date >= period_begin*
and cohort_start_date <= period_end*
then subject_id*
else null*
end) subjects_incidence*
from test_writable_sulieman.cohorttabletest* inner join ofcjbawzcalendar_periods cp on (
Thanks @jposada . I believe something get switched in the render and translate process. I have been going through the files to identify the source of the problem. The original query is suppose to group on cohort_definition_id cohort_id,period_begin and ,calendar_interval. For some reason, during rendering and translation, the group by is switched from “GROUP BY period_begin,calendar_interval,cohort_definition_id;”
to
“GROUP BY 2,3,6”
So I changed the original query from “GROUP BY period_begin,calendar_interval,cohort_definition_id;” to “GROUP BY 1,2,3” to eliminate the chances of change during translation and I am still getting the same error.
@Gowtham_Rao , Apparently I just hit reply when you did. Yes, I believe it is a translation issue. check the post that I just wrote. Something is off about the translation of GROUP BY
hi @Gowtham_Rao if you could help us identify which function is launching that SQL it is easier to get to the root cause and possibly push a fix in SQlrender of Dbconnector with the wonderful help of @Konstantin_Yaroshove and @gregk
@jposada (Also for some of tagged people @Gowtham_Rao@Konstantin_Yaroshove@gregk ) : I figured it out. The error happened in ComputeTimeSeries.sql
Apparently the translator will mess up the order of column names when we rename the column. For instance, this query renamed the first column, so the translator always switch column 1 (the renamed one) to 6 which is an aggregate function.
So my fix as rewriting the query in that file, to rename all the columns in inner query and use an outer query for aggregation. I believe this error will happen for similar queries. I will test the other parameters by turning them on. Below, I included my rewritten query for anyone who might have similar issue or for future releases:
,subject_id
,(datediff(dd, CASE
WHEN cohort_start_date >= period_begin
THEN cohort_start_date
ELSE period_begin
END, CASE
WHEN cohort_end_date >= period_end
THEN period_end
ELSE cohort_end_date
END) + 1) days
,(CASE
WHEN cohort_start_date >= period_begin
AND cohort_start_date <= period_end
THEN subject_id
ELSE NULL
END) r_incidence
,( CASE
WHEN cohort_start_date >= period_begin
AND cohort_start_date <= period_end
THEN subject_id
ELSE NULL
END) s_incidence
FROM @cohort_database_schema.@cohort_table
INNER JOIN #calendar_periods cp ON (
cohort_start_date >= period_begin
AND cohort_start_date <= period_end
)
OR (
cohort_end_date >= period_begin
AND cohort_end_date <= period_end
)
WHERE cohort_definition_id IN (@cohort_ids)
)
GROUP BY cohort_id,period_begin,calendar_interval;
@Gowtham_Rao : I summarized the issue and submitted it here.
Let me know what is the next step or if there is anything else I can do.
I would be happy to do a PR but it will be great if you tell me what does the PR entitle