Error when running CohortDiagnostics with BigQuery

LinaSulieman · July 27, 2021, 4:47am

I have been trying to implement cohort diagnostics for a BgQuery dataset. I got multiple errors but one of them was mainly about not allowing an aggregation function in Group By. Anyone had a similar problem or error?

For instance running the simplest Cohort Diagnostic function:
runCohortDiagnostics(baseUrl = baseUrl,
cohortSetReference = cohortsToCreate,
connectionDetails = connectionDetails,
cdmDatabaseSchema = “SCHEMA”,
tempEmulationSchema = " SCHEMA ",
cohortDatabaseSchema = " SCHEMA ",
cohortTable = “cohorttable”,
inclusionStatisticsFolder = inclusionStatisticsFolder,
exportFolder = exportFolder,
databaseId = 'DatasetID,
runInclusionStatistics = FALSE,
runIncludedSourceConcepts = TRUE,
runOrphanConcepts = TRUE,
runTimeDistributions = TRUE,
runBreakdownIndexEvents = TRUE,
runIncidenceRate = FALSE,
runCohortOverlap = FALSE,
runCohortCharacterization = FALSE,
minCellCount = 0)

gave this error

Error:
java.sql.SQLException: [Simba]BigQueryJDBCDriver Error executing query job. Message: Column 6 contains an aggregation function, which is not allowed in GROUP BY at [53:19]

SQL:
select cohort_definition_id cohort_id

,period_begin*
,calendar_interval*
,count() records
,count(distinct subject_id) subjects*
,sum(date_diff(if(safe_cast(case *

  		when cohort_end_date >= period_end*

```
  			then period_end*
```
```
  		else cohort_end_date*
```

  		end  as date) is null,parse_date('%Y%m%d', cast(case *

  		when cohort_end_date >= period_end*

```
  			then period_end*
```
```
  		else cohort_end_date*
```
```
  		end  as string)),safe_cast(case *
```

  		when cohort_end_date >= period_end*

```
  			then period_end*
```
```
  		else cohort_end_date*
```

  		end  as date)), if(safe_cast(case *

  		when cohort_start_date >= period_begin*

```
  			then cohort_start_date*
```
```
  		else period_begin*
```

  		end  as date) is null,parse_date('%Y%m%d', cast(case *

  		when cohort_start_date >= period_begin*

```
  			then cohort_start_date*
```
```
  		else period_begin*
```
```
  		end  as string)),safe_cast(case *
```

  		when cohort_start_date >= period_begin*

```
  			then cohort_start_date*
```
```
  		else period_begin*
```

  		end  as date)), day) + 1) person_days*

,count(case *

  	when cohort_start_date >= period_begin*

  		and cohort_start_date <= period_end*

```
  		then subject_id*
```
```
  	else null*
```
```
  	end) records_incidence*
```
,count(distinct case *

  	when cohort_start_date >= period_begin*

  		and cohort_start_date <= period_end*

```
  		then subject_id*
```
```
  	else null*
```
```
  	end) subjects_incidence*
```
from test_writable_sulieman.cohorttabletest*
inner join ofcjbawzcalendar_periods cp on (
```
  cohort_start_date >= period_begin*
```
```
  and cohort_start_date <= period_end*
```
```
  )*
```
or (*
```
  cohort_end_date >= period_begin*
```
```
  and cohort_end_date <= period_end*
```
```
  )*
```

where cohort_definition_id in (101431)

group by 2, 3, 6 ;*

jposada · July 27, 2021, 8:32pm

hi @Gowtham_Rao,

Any help here is much appreciated.

Gowtham_Rao · July 27, 2021, 9:50pm

I am not really sure. I think this would be a SQL translation issue?

LinaSulieman · July 27, 2021, 9:50pm

Thanks @jposada . I believe something get switched in the render and translate process. I have been going through the files to identify the source of the problem. The original query is suppose to group on cohort_definition_id cohort_id,period_begin and ,calendar_interval. For some reason, during rendering and translation, the group by is switched from “GROUP BY period_begin,calendar_interval,cohort_definition_id;”
to
“GROUP BY 2,3,6”
So I changed the original query from “GROUP BY period_begin,calendar_interval,cohort_definition_id;” to “GROUP BY 1,2,3” to eliminate the chances of change during translation and I am still getting the same error.

@Gowtham_Rao , @gregk , any thoughts?

LinaSulieman · July 27, 2021, 9:51pm

@Gowtham_Rao , Apparently I just hit reply when you did. Yes, I believe it is a translation issue. check the post that I just wrote. Something is off about the translation of GROUP BY

jposada · July 27, 2021, 11:50pm

hi @Gowtham_Rao if you could help us identify which function is launching that SQL it is easier to get to the root cause and possibly push a fix in SQlrender of Dbconnector with the wonderful help of @Konstantin_Yaroshove and @gregk

LinaSulieman · July 28, 2021, 9:28pm

@jposada (Also for some of tagged people @Gowtham_Rao @Konstantin_Yaroshove @gregk ) : I figured it out. The error happened in ComputeTimeSeries.sql
Apparently the translator will mess up the order of column names when we rename the column. For instance, this query renamed the first column, so the translator always switch column 1 (the renamed one) to 6 which is an aggregate function.
So my fix as rewriting the query in that file, to rename all the columns in inner query and use an outer query for aggregation. I believe this error will happen for similar queries. I will test the other parameters by turning them on. Below, I included my rewritten query for anyone who might have similar issue or for future releases:

SELECT cohort_id,period_begin,calendar_interval,
COUNT_BIG(*) records,
COUNT(DISTINCT subject_id) subjects,
SUM(days) person_days,COUNT_BIG(r_incidence) records_incidence,COUNT_BIG(DISTINCT s_incidence) subjects_incidence
FROM
(SELECT cohort_definition_id cohort_id
,period_begin
,calendar_interval

,subject_id
,(datediff(dd, CASE 
			WHEN cohort_start_date >= period_begin
				THEN cohort_start_date
			ELSE period_begin
			END, CASE 
			WHEN cohort_end_date >= period_end
				THEN period_end
			ELSE cohort_end_date
			END) + 1) days
,(CASE 
		WHEN cohort_start_date >= period_begin
			AND cohort_start_date <= period_end
			THEN subject_id
		ELSE NULL
		END) r_incidence
,( CASE 
		WHEN cohort_start_date >= period_begin
			AND cohort_start_date <= period_end
			THEN subject_id
		ELSE NULL
		END) s_incidence

FROM @cohort_database_schema.@cohort_table
INNER JOIN #calendar_periods cp ON (
cohort_start_date >= period_begin
AND cohort_start_date <= period_end
)
OR (
cohort_end_date >= period_begin
AND cohort_end_date <= period_end
)
WHERE cohort_definition_id IN (@cohort_ids)
)
GROUP BY cohort_id,period_begin,calendar_interval;

Gowtham_Rao · July 28, 2021, 11:37pm

Interesting - and thank you!!

Could you please summarize it here Issues · OHDSI/CohortDiagnostics · GitHub
and i will issue a hot fix.

If you are interested and able, please feel free to do a PR too.

LinaSulieman · July 29, 2021, 4:36am

@Gowtham_Rao : I will summarize it and post to on Github issue.
Do you mind telling me what is PR? have been using the tool for three weeks only

LinaSulieman · July 29, 2021, 10:44pm

@Gowtham_Rao : I summarized the issue and submitted it here.

github.com/OHDSI/CohortDiagnostics

Cohort diagnostic - Translation error for query in ComputeTimeSeries.sql

opened 10:42PM - 29 Jul 21 UTC

closed 06:07PM - 13 Aug 21 UTC

Linasulieman

Version 2.1.1

I am trying to run the CohortDiagnostic function on a database that is saved in …a BigQuery schema. When I ran the the runCohortDiagnostics function I got this error that was generated for the query in ComputeTimeSeries.sql **The error that I got was:** Error: java.sql.SQLException: [Simba]BigQueryJDBCDriver Error executing query job. Message: Column 6 contains an aggregation function, which is not allowed in GROUP BY at [53:19] SQL: select cohort_definition_id cohort_id ,period_begin ,calendar_interval ,count() records ,count(distinct subject_id) subjects ,sum(date_diff(if(safe_cast(case when cohort_end_date >= period_end then period_end else cohort_end_date end as date) is null,parse_date('%Y%m%d', cast(case when cohort_end_date >= period_end* then period_end else cohort_end_date end as string)),safe_cast(case when cohort_end_date >= period_end then period_end else cohort_end_date end as date)), if(safe_cast(case when cohort_start_date >= period_begin then cohort_start_date else period_begin end as date) is null,parse_date('%Y%m%d', cast(case when cohort_start_date >= period_begin then cohort_start_date else period_begin end as string)),safe_cast(case when cohort_start_date >= period_begin then cohort_start_date else period_begin end as date)), day) + 1) person_days ,count(case when cohort_start_date >= period_begin and cohort_start_date <= period_end then subject_id else null end) records_incidence ,count(distinct case when cohort_start_date >= period_begin and cohort_start_date <= period_end then subject_id else null end) subjects_incidence from test_writable_sulieman.cohorttabletest inner join ofcjbawzcalendar_periods cp on ( cohort_start_date >= period_begin and cohort_start_date <= period_end ) or (cohort_end_date >= period_begin and cohort_end_date <= period_end ) where cohort_definition_id in (101431) group by 2, 3, 6 ; **Reason** After investigating, I realized that the query is suppose to group on column 1,2, and 3. The query rename column 1 as shown earlier in the query. Apparently the renaming of column 1 misses up the order or numbering in GROUP BY and always replaces 1 with 6. So renaming a column that will be used in GROUP BY will mess up the order during translation. **Fix** To fix this, I rewrote the query to have an outer and an inner query. All the renaming processes for columns that will be used in GROUP BY should happen in the inner query. I moved the GROUP BY to the outer query to avoid renaming and grouping on the same column in the same query. After rewriting the query, the function worked. the I included below the query that I wrote to solve this error: SELECT cohort_id,period_begin,calendar_interval, COUNT_BIG(*) records, COUNT(DISTINCT subject_id) subjects, SUM(days) person_days,COUNT_BIG(r_incidence) records_incidence,COUNT_BIG(DISTINCT s_incidence) subjects_incidence FROM (SELECT cohort_definition_id cohort_id ,period_begin ,calendar_interval ,subject_id ,(datediff(dd, CASE WHEN cohort_start_date >= period_begin THEN cohort_start_date ELSE period_begin END, CASE WHEN cohort_end_date >= period_end THEN period_end ELSE cohort_end_date END) + 1) days ,(CASE WHEN cohort_start_date >= period_begin AND cohort_start_date <= period_end THEN subject_id ELSE NULL END) r_incidence ,( CASE WHEN cohort_start_date >= period_begin AND cohort_start_date <= period_end THEN subject_id ELSE NULL END) s_incidence FROM @cohort_database_schema.@cohort_table INNER JOIN #calendar_periods cp ON ( cohort_start_date >= period_begin AND cohort_start_date <= period_end ) OR ( cohort_end_date >= period_begin AND cohort_end_date <= period_end ) WHERE cohort_definition_id IN (@cohort_ids) ) GROUP BY cohort_id,period_begin,calendar_interval;

Let me know what is the next step or if there is anything else I can do.
I would be happy to do a PR but it will be great if you tell me what does the PR entitle

jposada · July 30, 2021, 2:00am

Hi @LinaSulieman ,

A PR is a Pull Request. You can find one example of one that is currently open here

You can learn more about what they are here

https://docs.github.com/en/github/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/about-pull-requests

LinaSulieman · July 30, 2021, 2:39am

Thanks @jposada . I have done of those in the past but I will definitely check the posts to refresh my memory