Grouping by complex expressions

MatthewYoungLai · May 19, 2017, 9:01pm

Continuing the discussion from Supporting DBMS platforms other than SQL Server, PostgreSQL, Oracle and RedShift :

I’m taking another look at BigQuery support. An issue that I’m running into with the queries in Achilles_v5.sql is that they sometimes do things of this form:

select complex_expression( col ) from t group by complex_expression( col )

BigQuery can’t handle this properly and complains that col needs to be in the group by clause. Adding BigQuery-specific translation rules to SqlRender to give the following could work:

select complex_expression( col ) from t group by col

However, so far it seems like doing that in SqlRender isn’t going to be straightforward. As an alternative, I’m wondering if it might be permissible to change the queries to this form in Achilles_v5.sql:

select complex_expression( col ) from t group by 1

Are there any currently supported DBMS systems that this would break?

Chris_Knoll · May 20, 2017, 5:01pm

I don’t think you’ll find all databases interpreting ‘group by 1’ as to mean group by a column index. Some DBs will treat that as a scalar value. If you can show that group by 1 means group by column index in all cases, I don’t see why we can’t adopt that format for grouping columns.

Tho I’d also suggest you cover all the necessary features required to run the achilles script (which pretty much leverages all the sql features that woudl be needed in any of the other tools,s o getting achilles to work should get you working across the board):

Temp tables
Window Functions
common table expressions (CTEs) (not recursive)
DDL actions such as CREATE, INSERT, DELETE

After you tick all those boxes, then i think you could work on re-witing queries. What I’m hoping to avoid (unless you have the time) is rewriting scripts and then finding out one of the above doesn’t work and then all the efforts doesn’t get you a complete solution.

-Chris

MatthewYoungLai · May 21, 2017, 3:27pm

Based on some testing it looks like you’re correct: sql server and postgres support the “group by 1” syntax but oracle doesn’t. For reference, I used these sites to test that: livesql.oracle.com, rextester.com/l/sql_server_online_compiler, sqlfiddle.com

However, it looks like all three support grouping by the base column references in the complex expression. That is, they’ll accept the following:

select complex_expression( col1, col2, col3 ), count(*) from t group by col1, col2, col3

I think redshift and netezza are based on posgresql and should also work. I’m not sure about impala. Are there any other DBs that need to be supported? Is there some test environment where it’s easy to confirm that SQL statements work across all the DBs OHDSI uses?

To your other point, I’ve taken an approach of adding support for missing features in the starschema JDBC driver. The patched code is here: https://github.com/myl-google/starschema-bigquery-jdbc/tree/dml . Window functions and CTEs are already supported by the new standard sql dialect in BigQuery. I needed to add temp tables and DDL support but think I have all the required features covered now. Currently, I’m about halfway done working through the achilles script and adding rewrites to SqlRender. I’ve worked around the GROUP BY problem so far, but it would be really nice to find a simpler solution.

Matthew

Patrick_Ryan · May 21, 2017, 3:45pm

@MatthewYoungLai, thanks for digging into this, I’m sure @mgkahn, @karthik, @hripcsa and others are going to be quite excited that Google is working on developing this out. If BigQuery gets to the point that it supports SQL requirements for the webAPI, then it’d be great to have it onboard, much as @shawndolley and @tomwhite are pushing nicely to have Impala be established as an accepted RDBMS platform for OHDSI users. We are definitely happy to see the OHDSI stack be compatible with whatever RDBMS platforms, so long as the community puts the development effort into testing that the required functions are supported. @Frank leads the OHDSI Architecture workgroup and if there’s specific issues you run into where you need to chat with a team live, in addition to the async forum discussions, then that’d be the place to do it. @lee_evans created a test harness for Postgres, Oracle, and SqlServer, and I don’t see any reason why that can’t be expanded to Impala, Redshift, Netezza, and BigQuery, it just means someone needs to stand up an environment and make it publicly available for testing purposes.

The ACHILLES script is definitely a good place to start, but it does not contain all the SQL functions that we may use of throughout the webAPI or within the various R packages, like CohortMethod and PatientLevelPrediction. Once ACHILLES is working, another thing to look at are the functions that are contained with SqlRender across the dialects…if they are in there, it’s for a reason (that is actually used in some analytical function). @schuemie may have another recommendation of where else to look to test things. But in terms of the big-ticket items that were holding back BigQuery before, I think it was windowing functions, CTEs, temp tables, and efficient CREATE/INSERT/CTAS options.

Chris_Knoll · May 21, 2017, 7:03pm

Hey, @Patrick_Ryan, I’m very much interested in knowing what those are so that when I suggest a checklist for compatibility, I cover all the bases. Which functions are you referring to that used outside of Achilles?

Patrick_Ryan · May 21, 2017, 7:13pm

@schuemie can correct me if I misspeak, but I expect that the ‘master’ list
of functions we use is subsumed by the SqlRender replacement pattern
listing:

github.com

OHDSI/SqlRender/blob/master/inst/csv/replacementPatterns.csv

To,Pattern,Replacement
oracle,"INSERT INTO @table (@columns) VALUES (@values1),(@values2)",<start_of_insert> INSERT INTO @table (@columns) VALUES (INTO @table <par_open>@columns<par_close> VALUES <par_open>@values1<par_close>\n INTO @table <par_open>@columns<par_close> VALUES <par_open>@values2<par_close>\n)
oracle,INTO @table <par_open>@columns<par_close> VALUES <par_open>INTO @table <par_open>@columns<par_close> VALUES,INTO @table <par_open>@columns<par_close> VALUES
oracle,<start_of_insert> @a INSERT INTO @table (@columns) VALUES (@b),INSERT ALL\n@bSELECT * FROM dual
oracle,<par_open>,(
oracle,<par_close>,)
oracle,EXCEPT,MINUS
oracle,"CONCAT(@a, @b, @c)","CONCAT(@a, CONCAT(@b, @c))"
oracle,"DATEADD(second,@seconds,@datetime)","(@date + NUMTODSINTERVAL(@seconds, 'second'))"
oracle,"DATEADD(minute,@minutes,@datetime)","(@date + NUMTODSINTERVAL(@minutes, 'minute'))"
oracle,"DATEADD(hour,@hours,@datetime)","(@date + NUMTODSINTERVAL(@hours, 'hour'))"
oracle,"DATEADD(d,@days,@date)","(@date + NUMTODSINTERVAL(@days, 'day'))"
oracle,"DATEADD(dd,@days,@date)","(@date + NUMTODSINTERVAL(@days, 'day'))"
oracle,"DATEADD(day,@days,@date)","(@date + NUMTODSINTERVAL(@days, 'day'))"
oracle,"DATEADD(month,@months,@date)","ADD_MONTHS(@date, @months)"
oracle,"DATEADD(mm,@months,@date)","ADD_MONTHS(@date, @months)"
oracle,"DATEADD(m,@months,@date)","ADD_MONTHS(@date, @months)"
oracle,"DATEADD(year,@years,@date)","ADD_MONTHS(@date, 12 * @years)"
oracle,"DATEADD(yyyy,@years,@date)","ADD_MONTHS(@date, 12 * @years)"
oracle,"DATEADD(yy,@years,@date)","ADD_MONTHS(@date, 12 * @years)"

This file has been truncated. show original

Patrick_Ryan · May 21, 2017, 7:19pm

That was a dumb reply, sorry. Replacement patterns would only cover those
cases where the syntax is different between RDBMS systems. Of course,
there are a multitude of ANSI SQL functions that contain the same syntax
(so aren’t in replacementPatterns.csv) that we make heavy use of, like
SELECT, COUNT, MIN, MAX, FROM, WHERE, GROUP BY, HAVING, CASE WHEN THEN ELSE
END , plus mathematical operators on numeric columns (we already know dates
are different), etc… The more interesting/annoying case is when the
syntax is the same but the execution is different. We had the earlier
case of BETWEEN being in all dialects, but its interpretation between
different such that we need to be explicit in our OHDSql to say a <= b
AND b <= c instead of doing b BETWEEN a and c. I think this thread
with the example of how GROUP BY is processed is another good example where
I don’t know off the top of my head how to solve it.

schuemie · May 22, 2017, 4:10am

The SqlRender vignette contains a non-exhaustive list of things the translation should support. That might be a good place to start…

Chris_Knoll · May 22, 2017, 5:17am

Good one, @schuemie, I’ll point future requests at that link. There are a couple of things in that list that I don’t believe are used in Achilles (INTERSECT and EXCEPT) so there a few small gaps, but I think first step into getting a platform certified for OHDSI tools would be go get an Achilles implementation working without script modifications (or if modifications are necessary, that they are comparable with the other platforms). naturally, it’s not the only functions required, but it’s a good minimum test.

-Chris

MatthewYoungLai · May 24, 2017, 12:50am

Thanks for that link. I’ll take a closer look at it as a next step once I have the achilles script working.

I’d also like to get more details about the mentioned test harnesses, but will start a separate thread to follow up about that later.

For the GROUP BY problem, the cleanest workaround I’ve found so far is to use some of the SqlRender utility functions in a custom control loop outside of the one that usually iterates through replacementPatterns.csv. The idea is to flatten all the elements down to base column references. There’s a code snippet below in case anyone wants to comment on whether this sort of thing would ever be acceptable as a patch to SqlRender.

private static String flattenBigQueryGroupBy(String sql) {
	List<Block> parsedPattern = parseSearchPattern("group by @@a;");
	MatchedPattern matchedPattern = search(sql, parsedPattern, 0);
	while (matchedPattern.start != -1) {
		String last_group_by = null;
		String next_group_by = "," + matchedPattern.variableToValue.get("@@a") + ",";
		do {
			last_group_by = next_group_by;
			final String[][] replacement_patterns = {
					{"dateadd(@@a,@@b)", "@@b"},
					{"datediff(@@a,@@b)", "@@b"},
					{", @@([a-z]+)f(@@a)", ", @@a"},
					{", (@@a)", ", @@a"},
					{", @@a - @@b,", ", @@a, @@b,"},
					{", @@a / @@b,", ", @@a, @@b,"},
					{", @@([0-9]+)c", ""},
					{"+", ","},
					{"*", ","},
			};
			for (int i = 0; i < replacement_patterns.length; ++i) {
				next_group_by = searchAndReplace(
                                        next_group_by,
                                        parseSearchPattern(replacement_patterns[i][0]),
                                        replacement_patterns[i][1]);
			}
		} while (!next_group_by.equalsIgnoreCase(last_group_by));
		sql = sql.substring(0, matchedPattern.start) + " group by "
				+ next_group_by.substring(1, next_group_by.length() - 1) + ";"
				+ sql.substring(matchedPattern.end, sql.length());
		matchedPattern = search(sql, parsedPattern, matchedPattern.startToken + 1);
	}
	return sql;
}

schuemie · May 24, 2017, 9:08am

Hi @MatthewYoungLai, just to satisfy my curiosity, could you provide some real examples (so not pseudocode) of queries that are currently failing to translate to BigQuery because of the GROUP BY issue?

One thing that might be helpful is the SqlRenderWeb app that allows you to test your code against SQL Server, PostgreSQL, and Oracle.

MatthewYoungLai · May 24, 2017, 1:48pm

Here are a couple of examples:

insert into ohdsi.ACHILLES_results (analysis_id, stratum_1, count_value)
select 101 as analysis_id, CAST(year(op1.index_date) - p1.YEAR_OF_BIRTH AS VARCHAR(255)) as stratum_1, COUNT_BIG(p1.person_id) as count_value
from ohdsi.PERSON p1
inner join (select person_id, MIN(observation_period_start_date) as index_date from ohdsi.OBSERVATION_PERIOD group by PERSON_ID) op1
on p1.PERSON_ID = op1.PERSON_ID
group by year(op1.index_date) - p1.YEAR_OF_BIRTH;

insert into ohdsi.ACHILLES_results (analysis_id, stratum_1, count_value)
select 108 as analysis_id, CAST(floor(DATEDIFF(dd, op1.observation_period_start_date, op1.observation_period_end_date)/30) AS VARCHAR(255)) as stratum_1, COUNT_BIG(distinct p1.person_id) as count_val
ue
from ohdsi.PERSON p1
inner join
(select person_id,
OBSERVATION_PERIOD_START_DATE,
OBSERVATION_PERIOD_END_DATE,
ROW_NUMBER() over (PARTITION by person_id order by observation_period_start_date asc) as rn1
from ohdsi.OBSERVATION_PERIOD
) op1
on p1.PERSON_ID = op1.PERSON_ID
where op1.rn1 = 1
group by floor(DATEDIFF(dd, op1.observation_period_start_date, op1.observation_period_end_date)/30)
;

schuemie · May 24, 2017, 2:47pm

Sorry for being dense, but are you saying that this query (a simplified version of one of your examples)

SELECT FLOOR(DATEDIFF(dd, observation_period_start_date, observation_period_end_date) / 30) AS months_obs,
	COUNT(DISTINCT person_id) AS count_value
FROM ohdsi.observation_period
GROUP BY FLOOR(DATEDIFF(dd, observation_period_start_date, observation_period_end_date) / 30);

fails because the GROUP BY is complex, but this query

SELECT FLOOR(DATEDIFF(dd, observation_period_start_date, observation_period_end_date) / 30) AS months_obs,
	COUNT(DISTINCT person_id) AS count_value
FROM ohdsi.observation_period
GROUP BY observation_period_start_date, observation_period_end_date;

works? But the semantics of the second query is completely different from the first one, right?

MatthewYoungLai · May 24, 2017, 3:35pm

Yes, that’s a good point. Extracting base columns in the GROUP BY is often equivalent since the values are 1-1 with values of the complex expression. It’s not equivalent when the base column values are many to one with the expression values as in the example.

So I’ll need to change to use the second workaround mentioned earlier based on select list ordinal references. For the second example, that looks like this:

SELECT FLOOR(DATEDIFF(dd, observation_period_start_date, observation_period_end_date) / 30) AS months_obs,
COUNT(DISTINCT person_id) AS count_value
FROM ohdsi.observation_period
GROUP BY 1;

For the first example, there’s a potential problem. The expression in the SELECT list and GROUP BY are different:

Select list:

CAST(year(op1.index_date) - p1.YEAR_OF_BIRTH AS VARCHAR(255))

group by:

year(op1.index_date) - p1.YEAR_OF_BIRTH

In general, these values may not be 1-1. However, they always are for real year values which fit into a varchar(255) without loss of precision. Trying to automatically recognize when this is true isn’t really going to be feasible as part of rewriting. The alternative is to just assume that it’s true for any GROUP BY expression that matches a SELECT list expression except for a CAST. Does that sound reasonable?

MatthewYoungLai · May 24, 2017, 8:33pm

OK, at least in the achilles script, it seems that all GROUP BY expressions that don’t exactly match a select list expression differ only in a single cast to varchar(255). I’ve now managed to run the entire script successfully using synpuf data loaded into bigquery. Besides the starschema jdbc driver patches, the only things that went outside standard SqlRender usage were the GROUP BY issue and handling named column lists in common table expressions. The matchBigQueryGroupBy() and aliasBigQueryCommonTableExpressions() functions in the following show how those were handled:

github.com

myounglai/SqlRender/blob/04938f954476b779083cb19f532c9136ba2e0a4b/java/org/ohdsi/sql/SqlTranslate.java

/*******************************************************************************
 * Copyright 2017 Observational Health Data Sciences and Informatics
 * 
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 * 
 *   http://www.apache.org/licenses/LICENSE-2.0
 * 
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 ******************************************************************************/
package org.ohdsi.sql;

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.IOException;

This file has been truncated. show original

The need for most or all of these patches will hopefully go away gradually as open BigQuery feature requests get addressed. In the meantime, I’m going to continue looking into what other changes are needed to run things beyond the achilles scripts.

schuemie · May 25, 2017, 6:41am

Impressive work!

I’d be happy to include a patch in SqlRender, since the effect will clearly be isolated to BigQuery translations only. But before we do, I would like to see some unit tests added for BigQuery, including tests specifically for the GROUP BY fix. Unit tests can be found here,

We also highly recommend making a BigQuery instance available (through secure credentials) for testing as part of our continuous integration. For example, we test that Achilles at least doesn’t break for three platforms with each push to the repo. It would be great if we could add BigQuery to that.

MatthewYoungLai · May 25, 2017, 3:45pm

Sounds good! I’ll add unit tests and otherwise get the changes ready to submit pull requests.

A few questions related to setting up a test instance:

I’ll probably want to provide credentials as a userid and a p12 file. Will it be possible to store that file somewhere on the machine that runs the continuous integration so that the password field can reference the path of the file?
How is access to the testing credentials controlled and how many people have access?
How often does the continuous integration run every day? (This may be relevant for setting up quota for the instance.)
What data should be in the test instance? Would a complete schema with all empty tables be sufficient since the continuous integration isn’t checking results?

schuemie · May 26, 2017, 11:01am

To answer your questions:

In the past we’ve set credentials as environmental variables on Travis. I don’t have experience with p12 files, but imagine we could encode a p12 file as an environmental variable.
The credentials are set on Travis. The only two persons who have access to setting the variables are me and @msuchard. In theory, a malicious submitter could push code that reveals the environmental variables, but that would be (a) immediately apparent, (b) completely traceable to the villain, and © could still only be one of the OHDSI developers.
Continuous integration runs whenever someone pushes to the master branch of the repository. As you can see here, that happens once every few days for Achilles.
Yes, for now just having the empty tables of the CDM schema should be sufficient.If we think of more elaborate unit tests we will need to think of a data set that should be included in all testing environments, but we have not made any plans to do so.

MatthewYoungLai · May 26, 2017, 9:20pm

Thanks, I just created a pull request for the SqlRender changes: https://github.com/OHDSI/SqlRender/pull/70

I’ll set up the test environment and submit additional patches for Achilles and DatabaseConnector next week.

MatthewYoungLai · May 31, 2017, 10:33pm

Two more pull requests: https://github.com/OHDSI/SqlRender/pull/71 , https://github.com/OHDSI/DatabaseConnector/pull/39

The second one seems to have failed the continuous build, but I think it’s for unrelated reasons. Please let me know if it looks otherwise.

The two changes above are enough to get the heel script working. (I just had the achilles script working before, and will have more followup changes to get the export scripts working as well.)

I also have a test environment set up now. How would you like me to pass over the credentials? One option if you have a gmail account is for me to put them in a GCS bucket and give your account permission to download from there.