Independent development for each database platform?

schuemie · April 27, 2020, 6:43am

We continue to discuss the need for our tools to support many different database plaforms Recently, some, including @lee_evans and @Chris_Knoll have suggested that a way to deal with the complexity that comes from supporting many platforms is to split DatabaseConnector and SqlRender into separate packages, one for each dialect.

At first glance this might seem like a good idea. the expected in- and output of these packages is well defined, and if we split them up, then different developers could own different platforms and implement the required functionality as they see fit, without having to worry about the other platforms. However, I have some arguments against this approach:

No consistent quality. I know some developers are frustrated with the fact that they’ve had to wait for me to review and approve their pull requests. However, in my opinion I was merely preventing calamity later down the road. If we split up the packages, there will be no such oversight. Some of the problems (in my mind) I did not allow to happen include:
- Downright errors that would have prevented all OHDSI applications from working
- Inefficient implementations that would have led to those worst of errors: intermittent ones, that once every n runs would for example throw an ‘out of heap’ error.
- Subtle changes in the semantics of the translated SQL, that would have probably gone unnoticed, but would have forever left us wondering why that particular database was always giving illogical results.
More complex deployment. Already it is pretty hard to write clear instructions on how people should run your study on their data. If we split up the packages there will likely be different instructions for each platform.
Code duplication. Each platform will get its own copy of the code, and these implementations will start to diverge. If we ever want to change something at a higher level (Say, implement an alternative approach to downloading big data), this will be extremely difficult.
DatabaseConnector and SqlRender are just one part of it. Although for platforms that faithfully and completely implement SQL most platform-specific code lives in DatabaseConnector and SqlRender, other platforms don’t play so nice, and require changes in all of our applications. I don’t think having platform-specific forks of each tool is the answer, for reasons 1-3 listed above.

Let me know if you agree or disagree with any of these arguments.

Christian_Reich · April 27, 2020, 12:50pm

@schuemie:

All arguments are right. But there is one counter argument: In the long run, you will have to go that direction and modularize things and have them run independently. Because the current situation where you have code in your analytic module that gets passed all the way through to completely different environments will not scale. It already does not. You have to debug the entire thing each time somebody makes a tiny change to an environment you don’t really understand and shouldn’t have to.

Actually, the way this will need to go is to split data access completely off and go to a microservice architecture. Your module will have to declare what you need, and the various data technologies will have their own way to fulfill your request efficiently. E.g. in Hadoop you can do Impala, or you have Spark. Huge difference.

In other words: You no longer can afford to be MS-DOS. You are going into Windows, and folks will write their DLLs to your generic operating system. I know, very ancient metaphor.

But otherwise these debates will never end, and the @schuemie development time will increasingly be consumed by this low-level stuff.

schuemie · April 27, 2020, 1:46pm

@Christian_Reich: I think you’re suggesting some abstraction layer between database and our tools. But I would say we already have that. It’s called ‘SQL’. But when everyone interprets this standard differently, we get the trouble we have right now. I’m not optimistic that adding a new abstraction layer will prove to be better.

Christian_Reich · April 27, 2020, 2:18pm

@schuemie:

You are totally right. Except you are wrong, as the reality shows us. That “standard” isn’t a standard. Which is what sucks out an increasing chunk of your time to keep in sync. And SQL for many technologies is a makeshift stopgap less-than-ideal temporary solution to be backwards compatible. It will not scale, and it will not be performant. And - you will always be involved in that stuff, no matter how much you try to draw a line in the sand.

Look: These technologies are here to stay. An increasing number of institutions are getting into large scale observational research BECAUSE the likes of Google push their technology. I don’t think we have a choice.

schuemie · April 27, 2020, 3:38pm

@Christian_Reich, I’m sorry, you completely lost me. Your tone suggests the solution is obvious, but I am slow to catch on.

I think you are saying we should develop an abstraction layer for performing our complex queries against databases, and that SQL most definitely is not this abstraction layer. But then what is this layer? Another query language? Are you proposing OHDSI develop this new standard from scratch, or is there an existing technology we can build on? How would this new layer do things differently from SQL?

Christian_Reich · April 27, 2020, 9:07pm

@schuemie

Sorry, didn’t mean to sound condescending at all. And I am not the guy who should discuss this with you. This is more for @gregk, @Chris_Knoll, @anthonysena, @Frank, @lee_evans or somebody who thinks about these architectural choices and understands data manipulation at scale.

But: This is how I would think about it:

Take out all the SQL from your R code and wrap it, so it can be called with standard input and standard output. Let’s say you have 50 of those calls that go to the data well.
Then create one module for each database dialect/data warehouse technology and put the 50 subroutines/functions in it.
For each database system create such a module. They could be auto-created using SQLRender, but could also be written de-novo. They may have SQL in it, or something completely different, like Spark.
For very complicated stuff, like cohort definitions, which you cannot anticipate what it would look like, we already have the JSON-based framework @Chris_Knoll created for ATLAS. Mick created a full rendition of that for Spark.

@schuemie would only have to debug his scripts for one module (say T-SQL). Other folks would maintain the other modules, and if they screw up it would have no effect on the package itself and any of the other modules. Each module could have their own optimization strategy, temp tables, indices, whatever. It could even work with a nested data model or other approach to improve performance.

Something like that. I am sure I am missing situations or constellations, but again, I am out of the kitchen RIGHT NOW.

gregk · April 27, 2020, 10:39pm

yes, @Christian_Reich - same thoughts here. We need to consider both de-coupling and re-use as our variables.

We can define a common interface and code separate adapters for each database that implement this interface. Nice and can work - good decoupling but less re-use
We can keep the same engine but break up SQL into database specific files for each database. By taking SQL out of Java or R, you are not only making it cleaner to support but also enabling to take advantage of database specific features. As you said, fundamentally each SQL takes the same input parameters and output. So you can imagine keeping all SQL in some kind of an JSON file and refer to it by unique ID (the same ID for the same SQL statement across all dialects) and then feed it parameters and read the output out into generic data transfer structure. Yes, you need to type each SQL separately but we kind of do it today by coding translation rules anyway. And you are also right by saying that this approach could enable no-SQL databases since you are feeding parameters in and getting data out - what is behind is a magic, but in this case we would need to combine #1 and #2

We actually did something exactly like some (long) time ago - it was very nice and clean, would love to share more ideas and info on that.

That all said, but not to underestimate the complexity of OHDSI eco system - I totally understand @schuemie concerns. I think we do need a good brainstorming architecture session and look at all options - I am sure with so many experienced people and ideas we can come up with something good at the end.

schuemie · April 28, 2020, 3:30am

Thanks @Christian_Reich, I now understand what you mean. And no, it won’t work, for two reasons:

You are underestimating the diversity of queries. Achilles alone has 374 SQL files. And with every new query we write while developing our software we’d get a new ‘module’. This would be impossible to maintain.
Having separate implementations for each platforms would mean massive code duplication, with all the problems that brings.

Note that the approach you and Greg appear to be proposing is actually already available in our tool stack. A long time ago, when SqlRender was first introduced, people did not like it at all and also argued we’d need platform-specific code. Because of this, all our SQL code now sits in these sql/sql_server folders in the respective packages. You want platform-specific SQL? Just create a sql/bigquery folder, and have SQL files with the same names as in the sql/sql_server folder, and they will automatically be picked up by the package when running on BigQuery. But nobody has ever used this functionality, because the code duplication would be horrible. Imagine 374 BigQuery-specific SQL files in Achilles. Whenever the Achilles developers would want to make a change in any of these, they now need to do it in two places?

schuemie · April 29, 2020, 6:35am

Ok, let me wrap up this discussion. We are in agreement for the need to support multiple database platforms. In another thread @Patrick_Ryan outlined the steps that need to be taken to make sure we can deal with the increased complexity this brings. I fully support those steps, and I know people have been tasked to move them forward.

In this thread, I mentioned that some people proposed a complementary strategy of splitting the code per database platform. I’ve outlined arguments against, and haven’t seen my arguments invalided, so that proposal is off the table (for now).

I think through DatabaseConnector and SqlRender we’ve reached a maximum decoupling between query logic and technological platform we can achieve for our kind of software. We do advanced analytics, and offload a lot of those analytics to the database server, so many database interactions are complex and unique for that analytic purpose. One key component of Patrick’s plan is defining the minimum requirements a platform should meet for us to support it, and this is essential because those requirements will define the limit of our analytics. And I don’t want to sacrifice fidelity of analytics simply to support more tech.