Ideas for SqlRender 2

Sigfried_Gold · January 5, 2017, 4:51pm

Thanks, Martijn.

I think that platform-specific feature is important, and should allow us the best of both worlds, but, right, only if people use it. Maybe they would be more likely to if there were examples in the code base? But also, does that feature only allow overriding of an OhdsiSql statement with a platform-specific statement, or can it also be used to add a platform-specific statement in such a way as to be used by people running that platform and to fail in some informative way for people who don’t?

Also, when talking about SqlRender, I have to admit that I don’t know the scope of how it fits into the OHDSI ecosystem. I know it’s invoked by every (I think) api call in WebAPI, and it’s used by a lot of the R code. But how easy would it be to apply it somehow to something like my proof-of-concept drugExposureExplorer code (https://github.com/OHDSI/DrugExposureExplorer/, http://www.ohdsi.org/web/drugexposureexplorer/) if someone who didn’t have Postgres wanted to run it, what could they do – or, rather – that’s a bad example because I do a lot of automated SQL construction in the API server, but if I wanted to make it available to others in the community without Postgres, what would I do? (The API server is written in Node.js.)

schuemie · January 6, 2017, 8:33am

Hi Sigfried,

Lets take a step back: All OHDSI tools that talk to the CDM need to solve the same two problems: connecting to different DBMSs (e.g. Oracle or Postgres), and talking the right SQL dialect. Two loosely coupled worlds have evolved so far with very similar solutions to these problems:

The world of R-packages, where the DatabaseConnector package (which uses JDBC drivers) solves the connection problem, and the SqlRender package solves the SQL dialect problem
The world of web tools based on the WebAPI. The WebApi is Java running on a server (Tomcat I guess). It solves the connection problem by directly using JDBC drivers, and uses the SqlRender library to solve the SQL dialect problem.

The SqlRender Java library is embedded in the SqlRender R package, so these two worlds are re-using the same code to solve the SQL dialect problem.

With your DrugExposureExplorer you are now introducing a third world, one based on Node.js. I can’t say whether it is a good idea to do so, but you should remember that you’re now asking users to install Node.js in their environment, next to Tomcat for the WebAPI and R for the R packages. I don’t know enough about Node.js to help you solve the two aforementioned problems. For the SQL dialect problem, perhaps you could use the SqlRender Java library through node-java. But if you want to use a different solution, like write platform-specific SQL for each target platform, you are certainly free to do so. I see SqlRender mostly to aid lazy people like me, who want to write code only once.

Alternatively, you could try and move your application to one of the two existing worlds. Maybe you could use a WebAPI back-end instead of the Node.js one? There would be many advantages, not in the least because you could use existing features like the security layer that is now being developed, and of course re-using WebAPI’s solutions to the connection and SQL dialect problem.

Christian_Reich · January 6, 2017, 12:11pm

@schuemie:

You solution is very strong in two ways:

you control it, which means OHDSI controls it to some degree
you built it, so we have it. Sounds trivial, but its the most important one.

It’s weak in other ways:

We shouldn’t bother with that stuff, it’s not an OHDSI problem
It’s linked to R, which is good for coding algorithms, but not for everything (like @Sigfried_Gold’s idea)
It’s very hard to maintain and expand

We should keep using the SQLRender bandaid while we still can. A real solution I would build with folks who have some headstart on that. In a prior life I built a parser based on context-free grammar (even some context at the end), and as much as I loved it intellectually, I would not touch it with a 10-foot pole today. It’s a time sink.

Sigfried_Gold · January 6, 2017, 1:34pm

@schuemie, your point about WebAPI’s new security layer is pretty hard to argue with. I have some reasons and justifications for doing my API development independently of WebAPI (I can implement new API features much more quickly and my client’s ultimate target platform, GBQ, is unlikely to be supported by WebAPI/SqlRender anytime soon), and I think, as a community, we need to discuss the possibility of providing alternatives to WebAPI (I’d love to see further discussion of @akshayubhat’s serialization ideas), but that might belong in a new thread. I will throw in, though, regarding the connection problem, that the API framework I’m using, loopback, solves it better than WebAPI does – and I think the primary justification for SqlRender is in the SQL dialect problem.

@JasonPoovey brought up the idea of a command line tool earlier in this thread, and that could be great for allowing SqlRender to be used beyond its current worlds of R packages and WebAPI.

You didn’t address my earlier question about SqlRender’s override features and whether they could allow a platform-specific query to exist without an OhdsiSql version. As @Christian_Reich (and you, to some degree) suggested, SqlRender’s life as a translation engine may be limited, but a focus on its override features (which should maybe be implemented in a separate utility?) could serve us in perpetuity. What we need, I think, are good conventions for storing SQL of various flavors, loading the appropriate flavor when it exists, assembling it to the appropriate flavor when its construction depends on variable parameters, and auto-translating it where possible. SqlRender’s override features haven’t been used so far, but I think if they were conceived more broadly, they could extend the use of its translation capacities indefinitely while also giving our community a way of solving some of its database diversity issues beyond R, WebAPI, and a small set of RDBMSs.

JasonPoovey · January 6, 2017, 2:52pm

A little late in the thread, but referring to alternatives to SQL translation engines is using something like an Object Relational Mapper (ORM). I’ve used these in python land for great success. However, the downside being it would not be language agnostic (e.g. we’d need one for R, Java, Python, node.js, etc.)

As far as templating goes, there’s also options like jinja or mustache we could use, rather than re-define a new DSL.

As far as ORM options go, here’s a small list I found for the languages in question:
node.js - http://docs.sequelizejs.com/en/v3/
python - sqlalchemy
R - dplyr
Java - Torque

A more complete list is here - https://en.wikipedia.org/wiki/List_of_object-relational_mapping_software

JasonPoovey · January 6, 2017, 3:03pm

Reading back through the rest of the thread. I agree with @Sigfried_Gold that having to start with MSSQL or OHDSI SQL is not good when you want all the bells and whistles. As it stands now you could put DBMS specific SQL as an option in addition to the MSSQL code. For SQLRender 2.0 the question would be, if I write a complex Postgres query, do I need to also write a simplified OHDSI SQL version as well before submitting it to the repository? I don’t really see a way around that without turning SQLRender into an all-to-all translator (which is definitely way outside the scope of OHDSI)

Sigfried_Gold · January 6, 2017, 3:27pm

Thanks, @JasonPoovey. I think the way around that problem is conventions and good error handling in the SqlRender code loader: 1) If I write a Postgres only query, I need to give people an easy way to opt in or out of using it (maybe it goes in a parallel GitHub branch…?), 2) If someone does end up with code only provided in SQL dialects other than the one they’re using, SqlRender can warn them and point them to it before runtime.

Chris_Knoll · January 6, 2017, 4:58pm

Hello everyone,
I was reading through the thread and thought I’d fill in some gaps:

When the WebAPI implementation was being formed, the question of which technology stack we’d use: java, C#, node, R. In the end, the community said that Java was the right platform to implement the services. This doesn’t exclude people from creating solutions in their favorite technology which can certainly be hosted in a repo under OHDSI, but for interop with WebAPI, we asserted that the technology stack going into it is Java. We wanted to avoid the problem of 'to use feature X, you must install platform Y (Y in ['python, ‘cold fusion’, ‘go’, ‘hack’, ‘haskell’, ‘node’…]). As long as the new feature can be packaged up in a java library and used without additional setup (R service bus being a notable exception, but we all agreed that the platforms we must support are R and Java) we can adopt it… So, moving forward with the decision to use Java, we incorporated Spring, which gives access to db connection tools (that sit on top of JDBC), ORM tools (which we are using JPA 2.1 provided by hibernate), a security framework provided by shiro (which provides connectors for different methods of authentication), etc.

So bearing this all in mind, talking about a technology solution that is provided by a non-Java platform (such as loopback’s solution for db connectivity but being an node-specific library) isn’t something we can adopt, at least from my understanding.

All the ideas above are very thoughtful and intense, but my worry is that the reason why SQLRender has lasted as long as it has is because there’s a fundamental principle: that we’re writing some form of SQL. That’s why my message to the Hadoop team was that ‘if you can get a SQL layer to sit on top of hadoop so that we can interact with the data using a SQL interface, then it’ll work. If you have to use a custom language construct to get to the data, it won’t (in the context of ohdsi tools)’. We do some pretty crazy (tho ansi-sql compliant) sql maneuvers to get the results from our data. When looking at some of the SQL abstraction frameworks, we were limited by what those abstractions exposed (temp tables being a notable exclusion). Now that we’re talking about hitting datasources in an agnostic way (maybe sql, maybe noSql, maybe some custom dataaccess construct), I don’t see us comming up with a custom language that completely abstracts us from the underlying datastore. SQL already does this: different db providers provide the access to their native database table binary implementations using a standardize query language (SQL). Are we here to invent another SQL? Hasn’t SQL been developed for decades? Haven’t countless others tried?

From my perspective, the main problem I’m seeing in SQL Render is writing queries with a SMP (Symmetric Multiprocessor) context and then expect it to work well in MPP (Massively Parallel Processor). For those of you who don’t know, SMP RDBMs are your typical MSSQL, oracle, postgresql databases, while MPP RDBMs are MS AWS, Amazon Redshift, Cloudera Impala, CitusDB). The main difference on the MPP side is how tables are distributed across nodes (using a partitioning strategy like hashing) and how queries should be structured to avoid data movement.

This second factor is what makes using a single query that gets translated very…interesting. Common performance problems in MMP are solved by CTASing a query into a re-partitioned form so that subsequent joins are ‘compatable’ (meaning that you don’t need to shuffled data between nodes to do the join). If you just write it like SMP, then MPP queries slow down a lot because the’re not efficient (tho they are both SQL). So I’ve spent a lot of time reworking queries that give adequate performance in both contexts and avoided writing platform specific code.

So, this is a long way of saying, if there’s something that I’d like to see in SQL Render 2.0, it’s a way to write a single SQL but give MPP hints so that certain optimazations can be made, but beyond the feeling of this need, I don’t really have a good solution to do this.

Long live sql render.

-Chris

JasonPoovey · January 6, 2017, 5:19pm

@Chris_Knoll I agree, let’s not re-write SQL.

I see the reasoning for sticking with Java and R to simplify setup. But a few points there:

For downstream analysis (CohortMethod and PLP) there is a more polyglot approach, and that is necessary to allow people to write in the best language and platform for the job (ref: PLP’s use of scikit-learn from python)
With the prevalence of containerization (e.g. Docker) the massive dependency problem can be somewhat mitigated. Just use Broadsea as the deployment platform to supply a docker container with all the dependencies pre-built, and then the “it’s too hard to install all of this” problem is much lower.

How this relates to SQLRender 2.0, is I think SQLRender has broader use beyond the WebAPI, and we do need to consider how it is used outside of Java and R.

hajarhomayouni · January 7, 2017, 8:03am

Scheumie,

With the supports for standard SQL by Google BigQuery, and the outstanding features it provides for Data Warehousing, it would be nice if you can add supports for GBQ to the new version of SqlRender. The new version of GBQ provides all types of DML and DDL queries and all functionalites required by OHDSI tools. I have added some translations to my fork for SqlRender for GBQ.

Moreover, Google has provided Simba JDBC driver to connect to GBQ which is needed for DatabaseConnector.

If you are interested in supporting GBQ, I am really interested to contribute.

schuemie · January 9, 2017, 12:44pm

Thanks everyone for contributing to the discussion!

I understand some (all) people don’t love SqlRender, but I also haven’t heard alternatives that would work for the OHDSI tools. So I am going ahead with a version 2.0.

I’ve tried to capture most ideas in this thread in the version 2.0 milestone. Please check it and let me know if you think something important is missing.

Also, let me know if you’re interested in helping develop version 2.0! We may organize a small task-force.

schuemie · January 9, 2017, 12:47pm

Hi @hajarhomayouni,

If I remember correctly, previous attempts at using GBQ for the OHDSI tools failed because there was no way to update data through SQL. So things like creating (temp) tables or uploading sets of IDs was not possible, even though (almost) all of the OHDSI tools require this.

Have you tried your version of SqlRender when running something like Achilles?

hajarhomayouni · January 26, 2017, 10:26pm

Hi schuemie,

The Standard SQL version of GBQ supports for all types of queries including multiple temp tables. We should use #standardSQL (in SqlRender) at the beginning of any queries we want to run on GBQ.
We can also Simba JDBC driver. I added code to [my fork][1] on DatabaseConnector to connect to GBQ using Simba.

schuemie · January 27, 2017, 8:35am

Hi @hajarhomayouni,

It looks like you’ve done an impressive amount of work! Two questions:

I would have a very strong preference for including GBQ in our continuous integration and testing. This would require a GBQ instance (possibly empty) that would be accessible from the Travis environment using a secure token. Would it be possible to set something like that up?
At the very least, add a large set of unit tests for GBQ translations here.

Finally, the de facto entry test we’ve being using for SQL dialects is whether you can get Achilles to run. Have you been able to do that?

hajarhomayouni · January 30, 2017, 5:04pm

Thank you schuemie for your reply, we are trying to set a GBQ instance accessible from Travis environment.
I have not tried running Achilles with GBQ as background, as soon as I get any progress on this, I will let you know.

schuemie · March 27, 2017, 7:35am

Hi all,

An update on SqlRender:

I played around with a full-parsing approach, but decided that SQL is just too complex a language. The benefit of having the added flexibility in SqlRender does not outweigh the cost of maintaining a highly complex parser.
I also looked at integrating SqlRender with dplyr. This does look promising, but actually belongs in the DatabaseConnector package rather than SqlRender. (I actually had this integration working against a developer version of dplyr, but then the developer version changed.)
Given that the remaining changes aren’t as dramatic, the new version is called 1.3.0 instead of 2.0.

You can find the new version on CRAN. Here is an overview of the changes:

Changes:

Added a command-line interface (see java -jar SqlRender.jar -help for details)
Added ability to use regular expression in translation patterns. This allowed SELECT TOP n to be translated.
Deprecated sourceDialect argument.
Added translation for CONCAT function with >2 arguments to Oracle (which only allows 2 arguments)
Added hints for translation optimation to massive parallel platforms like RedShift
Throw warnings when translateSql is called with variable names that are not in the SQL
Throw warnings when table names are too long for Oracle

Bugfixes:

Fixed translation for date functions so they will now work properly with datetime fields as well.
Now throwing error when boolean logic cannot be parsed (instead of assuming result is TRUE)

schuemie · September 26, 2017, 8:14am

We now have SqlRender Developer online! This is a Shiny app that allows folks to write OHDSI Sql and see in real time how it translates to the various SQL dialects. Note that this Shiny app can also be run locally from within R by typing

SqlRender::launchSqlRenderDeveloper()

Two other resources to point out are the OHDSI SQL Style Guide and the SqlRender vignette.

Vojtech_Huser · February 1, 2018, 3:27pm

The addition of Netezza and RedShift is making writing Heel SQL (that runs on all SQL flavors) quite tricky.

See example below of latest PR for Achilles

replacing all a+ b for strings with CONCAT() is a new rule for me.
Also, can group by be done just by refering to the order of the colum ‘group by 1,3,4’ ?

Can the vignette be updated? (or style guide)

even better - have just one document for SQL authoring advice.

schuemie · February 1, 2018, 4:59pm

The vignette is up to date as far as I can tell.

I’m unsure about the GROUP BY 1,3,4, but I don’t think it is a good idea anyway (what if someone changes the column order? Better to be explicit.

Chris_Knoll · February 1, 2018, 7:02pm

Agreed with the group by: use the column name, not an integer.

If you have complex columns, I’ve seen an approach that works best is to use a subquery to ‘hide’ the comlex column generation: ie:

select a, b, c, count(*) n FROM (
 select sum(*) as a, avg(case when x > 5 then 0 else 17) b, someComplex expression as c)
) Q
group by a,b,c