OHDSI Home | Forums | Wiki | Github

Error when running CohortDiagnostics with BigQuery

I have been trying to implement cohort diagnostics for a BgQuery dataset. I got multiple errors but one of them was mainly about not allowing an aggregation function in Group By. Anyone had a similar problem or error?

For instance running the simplest Cohort Diagnostic function:
runCohortDiagnostics(baseUrl = baseUrl,
cohortSetReference = cohortsToCreate,
connectionDetails = connectionDetails,
cdmDatabaseSchema = “SCHEMA”,
tempEmulationSchema = " SCHEMA ",
cohortDatabaseSchema = " SCHEMA ",
cohortTable = “cohorttable”,
inclusionStatisticsFolder = inclusionStatisticsFolder,
exportFolder = exportFolder,
databaseId = 'DatasetID,
runInclusionStatistics = FALSE,
runIncludedSourceConcepts = TRUE,
runOrphanConcepts = TRUE,
runTimeDistributions = TRUE,
runBreakdownIndexEvents = TRUE,
runIncidenceRate = FALSE,
runCohortOverlap = FALSE,
runCohortCharacterization = FALSE,
minCellCount = 0)

gave this error

Error:
java.sql.SQLException: [Simba]BigQueryJDBCDriver Error executing query job. Message: Column 6 contains an aggregation function, which is not allowed in GROUP BY at [53:19]

SQL:
select cohort_definition_id cohort_id

  • ,period_begin*
  • ,calendar_interval*
  • ,count() records
  • ,count(distinct subject_id) subjects*
  • ,sum(date_diff(if(safe_cast(case *
  •   		when cohort_end_date >= period_end*
    
  •   			then period_end*
    
  •   		else cohort_end_date*
    
  •   		end  as date) is null,parse_date('%Y%m%d', cast(case *
    
  •   		when cohort_end_date >= period_end*
    
  •   			then period_end*
    
  •   		else cohort_end_date*
    
  •   		end  as string)),safe_cast(case *
    
  •   		when cohort_end_date >= period_end*
    
  •   			then period_end*
    
  •   		else cohort_end_date*
    
  •   		end  as date)), if(safe_cast(case *
    
  •   		when cohort_start_date >= period_begin*
    
  •   			then cohort_start_date*
    
  •   		else period_begin*
    
  •   		end  as date) is null,parse_date('%Y%m%d', cast(case *
    
  •   		when cohort_start_date >= period_begin*
    
  •   			then cohort_start_date*
    
  •   		else period_begin*
    
  •   		end  as string)),safe_cast(case *
    
  •   		when cohort_start_date >= period_begin*
    
  •   			then cohort_start_date*
    
  •   		else period_begin*
    
  •   		end  as date)), day) + 1) person_days*
    
  • ,count(case *
  •   	when cohort_start_date >= period_begin*
    
  •   		and cohort_start_date <= period_end*
    
  •   		then subject_id*
    
  •   	else null*
    
  •   	end) records_incidence*
    
  • ,count(distinct case *
  •   	when cohort_start_date >= period_begin*
    
  •   		and cohort_start_date <= period_end*
    
  •   		then subject_id*
    
  •   	else null*
    
  •   	end) subjects_incidence*
    
  • from test_writable_sulieman.cohorttabletest*
    inner join ofcjbawzcalendar_periods cp on (
  •   cohort_start_date >= period_begin*
    
  •   and cohort_start_date <= period_end*
    
  •   )*
    
  • or (*
  •   cohort_end_date >= period_begin*
    
  •   and cohort_end_date <= period_end*
    
  •   )*
    

where cohort_definition_id in (101431)

  • group by 2, 3, 6 ;*

hi @Gowtham_Rao,

Any help here is much appreciated.

I am not really sure. I think this would be a SQL translation issue?

Thanks @jposada . I believe something get switched in the render and translate process. I have been going through the files to identify the source of the problem. The original query is suppose to group on cohort_definition_id cohort_id,period_begin and ,calendar_interval. For some reason, during rendering and translation, the group by is switched from “GROUP BY period_begin,calendar_interval,cohort_definition_id;”
to
“GROUP BY 2,3,6”
So I changed the original query from “GROUP BY period_begin,calendar_interval,cohort_definition_id;” to “GROUP BY 1,2,3” to eliminate the chances of change during translation and I am still getting the same error.

@Gowtham_Rao , @gregk , any thoughts?

@Gowtham_Rao , Apparently I just hit reply when you did. Yes, I believe it is a translation issue. check the post that I just wrote. Something is off about the translation of GROUP BY

hi @Gowtham_Rao if you could help us identify which function is launching that SQL it is easier to get to the root cause and possibly push a fix in SQlrender of Dbconnector with the wonderful help of @Konstantin_Yaroshove and @gregk

@jposada (Also for some of tagged people @Gowtham_Rao @Konstantin_Yaroshove @gregk ) : I figured it out. The error happened in ComputeTimeSeries.sql
Apparently the translator will mess up the order of column names when we rename the column. For instance, this query renamed the first column, so the translator always switch column 1 (the renamed one) to 6 which is an aggregate function.
So my fix as rewriting the query in that file, to rename all the columns in inner query and use an outer query for aggregation. I believe this error will happen for similar queries. I will test the other parameters by turning them on. Below, I included my rewritten query for anyone who might have similar issue or for future releases:

SELECT cohort_id,period_begin,calendar_interval,
COUNT_BIG(*) records,
COUNT(DISTINCT subject_id) subjects,
SUM(days) person_days,COUNT_BIG(r_incidence) records_incidence,COUNT_BIG(DISTINCT s_incidence) subjects_incidence
FROM
(SELECT cohort_definition_id cohort_id
,period_begin
,calendar_interval

,subject_id
,(datediff(dd, CASE 
			WHEN cohort_start_date >= period_begin
				THEN cohort_start_date
			ELSE period_begin
			END, CASE 
			WHEN cohort_end_date >= period_end
				THEN period_end
			ELSE cohort_end_date
			END) + 1) days
,(CASE 
		WHEN cohort_start_date >= period_begin
			AND cohort_start_date <= period_end
			THEN subject_id
		ELSE NULL
		END) r_incidence
,( CASE 
		WHEN cohort_start_date >= period_begin
			AND cohort_start_date <= period_end
			THEN subject_id
		ELSE NULL
		END) s_incidence

FROM @cohort_database_schema.@cohort_table
INNER JOIN #calendar_periods cp ON (
cohort_start_date >= period_begin
AND cohort_start_date <= period_end
)
OR (
cohort_end_date >= period_begin
AND cohort_end_date <= period_end
)
WHERE cohort_definition_id IN (@cohort_ids)
)
GROUP BY cohort_id,period_begin,calendar_interval;

Interesting - and thank you!!

Could you please summarize it here Issues · OHDSI/CohortDiagnostics · GitHub
and i will issue a hot fix.

If you are interested and able, please feel free to do a PR too.

@Gowtham_Rao : I will summarize it and post to on Github issue.
Do you mind telling me what is PR? have been using the tool for three weeks only :slight_smile:

@Gowtham_Rao : I summarized the issue and submitted it here.

Let me know what is the next step or if there is anything else I can do.
I would be happy to do a PR but it will be great if you tell me what does the PR entitle

Hi @LinaSulieman ,

A PR is a Pull Request. You can find one example of one that is currently open here

You can learn more about what they are here

Thanks @jposada . I have done of those in the past but I will definitely check the posts to refresh my memory

t