Hello everyone,
I was reading through the thread and thought I’d fill in some gaps:
When the WebAPI implementation was being formed, the question of which technology stack we’d use: java, C#, node, R. In the end, the community said that Java was the right platform to implement the services. This doesn’t exclude people from creating solutions in their favorite technology which can certainly be hosted in a repo under OHDSI, but for interop with WebAPI, we asserted that the technology stack going into it is Java. We wanted to avoid the problem of 'to use feature X, you must install platform Y (Y in ['python, ‘cold fusion’, ‘go’, ‘hack’, ‘haskell’, ‘node’…]). As long as the new feature can be packaged up in a java library and used without additional setup (R service bus being a notable exception, but we all agreed that the platforms we must support are R and Java) we can adopt it… So, moving forward with the decision to use Java, we incorporated Spring, which gives access to db connection tools (that sit on top of JDBC), ORM tools (which we are using JPA 2.1 provided by hibernate), a security framework provided by shiro (which provides connectors for different methods of authentication), etc.
So bearing this all in mind, talking about a technology solution that is provided by a non-Java platform (such as loopback’s solution for db connectivity but being an node-specific library) isn’t something we can adopt, at least from my understanding.
All the ideas above are very thoughtful and intense, but my worry is that the reason why SQLRender has lasted as long as it has is because there’s a fundamental principle: that we’re writing some form of SQL. That’s why my message to the Hadoop team was that ‘if you can get a SQL layer to sit on top of hadoop so that we can interact with the data using a SQL interface, then it’ll work. If you have to use a custom language construct to get to the data, it won’t (in the context of ohdsi tools)’. We do some pretty crazy (tho ansi-sql compliant) sql maneuvers to get the results from our data. When looking at some of the SQL abstraction frameworks, we were limited by what those abstractions exposed (temp tables being a notable exclusion). Now that we’re talking about hitting datasources in an agnostic way (maybe sql, maybe noSql, maybe some custom dataaccess construct), I don’t see us comming up with a custom language that completely abstracts us from the underlying datastore. SQL already does this: different db providers provide the access to their native database table binary implementations using a standardize query language (SQL). Are we here to invent another SQL? Hasn’t SQL been developed for decades? Haven’t countless others tried?
From my perspective, the main problem I’m seeing in SQL Render is writing queries with a SMP (Symmetric Multiprocessor) context and then expect it to work well in MPP (Massively Parallel Processor). For those of you who don’t know, SMP RDBMs are your typical MSSQL, oracle, postgresql databases, while MPP RDBMs are MS AWS, Amazon Redshift, Cloudera Impala, CitusDB). The main difference on the MPP side is how tables are distributed across nodes (using a partitioning strategy like hashing) and how queries should be structured to avoid data movement.
This second factor is what makes using a single query that gets translated very…interesting. Common performance problems in MMP are solved by CTASing a query into a re-partitioned form so that subsequent joins are ‘compatable’ (meaning that you don’t need to shuffled data between nodes to do the join). If you just write it like SMP, then MPP queries slow down a lot because the’re not efficient (tho they are both SQL). So I’ve spent a lot of time reworking queries that give adequate performance in both contexts and avoided writing platform specific code.
So, this is a long way of saying, if there’s something that I’d like to see in SQL Render 2.0, it’s a way to write a single SQL but give MPP hints so that certain optimazations can be made, but beyond the feeling of this need, I don’t really have a good solution to do this.
Long live sql render.
-Chris