Hello everyone,
I was reading through the thread and thought Iād fill in some gaps:
When the WebAPI implementation was being formed, the question of which technology stack weād use: java, C#, node, R. In the end, the community said that Java was the right platform to implement the services. This doesnāt exclude people from creating solutions in their favorite technology which can certainly be hosted in a repo under OHDSI, but for interop with WebAPI, we asserted that the technology stack going into it is Java. We wanted to avoid the problem of 'to use feature X, you must install platform Y (Y in ['python, ācold fusionā, āgoā, āhackā, āhaskellā, ānodeāā¦]). As long as the new feature can be packaged up in a java library and used without additional setup (R service bus being a notable exception, but we all agreed that the platforms we must support are R and Java) we can adopt itā¦ So, moving forward with the decision to use Java, we incorporated Spring, which gives access to db connection tools (that sit on top of JDBC), ORM tools (which we are using JPA 2.1 provided by hibernate), a security framework provided by shiro (which provides connectors for different methods of authentication), etc.
So bearing this all in mind, talking about a technology solution that is provided by a non-Java platform (such as loopbackās solution for db connectivity but being an node-specific library) isnāt something we can adopt, at least from my understanding.
All the ideas above are very thoughtful and intense, but my worry is that the reason why SQLRender has lasted as long as it has is because thereās a fundamental principle: that weāre writing some form of SQL. Thatās why my message to the Hadoop team was that āif you can get a SQL layer to sit on top of hadoop so that we can interact with the data using a SQL interface, then itāll work. If you have to use a custom language construct to get to the data, it wonāt (in the context of ohdsi tools)ā. We do some pretty crazy (tho ansi-sql compliant) sql maneuvers to get the results from our data. When looking at some of the SQL abstraction frameworks, we were limited by what those abstractions exposed (temp tables being a notable exclusion). Now that weāre talking about hitting datasources in an agnostic way (maybe sql, maybe noSql, maybe some custom dataaccess construct), I donāt see us comming up with a custom language that completely abstracts us from the underlying datastore. SQL already does this: different db providers provide the access to their native database table binary implementations using a standardize query language (SQL). Are we here to invent another SQL? Hasnāt SQL been developed for decades? Havenāt countless others tried?
From my perspective, the main problem Iām seeing in SQL Render is writing queries with a SMP (Symmetric Multiprocessor) context and then expect it to work well in MPP (Massively Parallel Processor). For those of you who donāt know, SMP RDBMs are your typical MSSQL, oracle, postgresql databases, while MPP RDBMs are MS AWS, Amazon Redshift, Cloudera Impala, CitusDB). The main difference on the MPP side is how tables are distributed across nodes (using a partitioning strategy like hashing) and how queries should be structured to avoid data movement.
This second factor is what makes using a single query that gets translated veryā¦interesting. Common performance problems in MMP are solved by CTASing a query into a re-partitioned form so that subsequent joins are ācompatableā (meaning that you donāt need to shuffled data between nodes to do the join). If you just write it like SMP, then MPP queries slow down a lot because theāre not efficient (tho they are both SQL). So Iāve spent a lot of time reworking queries that give adequate performance in both contexts and avoided writing platform specific code.
So, this is a long way of saying, if thereās something that Iād like to see in SQL Render 2.0, itās a way to write a single SQL but give MPP hints so that certain optimazations can be made, but beyond the feeling of this need, I donāt really have a good solution to do this.
Long live sql render.
-Chris