Let's talk about Persistance

Chris_Knoll · February 7, 2015, 5:44pm

The other thread I started (Managing Database Migrations), I was asking about manging database updates between releases. I’d like to start a discussion here about choosing a framework to manage our persistence of our custom domain objects that we’ll be developing for our apps. Examples: Protocol Definitions, Cohort Definitions, Concept Sets, User Profiles (if we create our own), basically anything that we’re going to need to define to support our applications.

I’d also like to draw a line here between sql that runs analyses, or cohort building logic, or special method logic, vs. the sql that our apps use to save state to the database. I’d like to not use sqlRender to manage the CRUD operatons that we will need in our apps. I think there’s better tools that abstract the database layer for us.

So, on the note of dabase abstraction layers, it seems the Java Persistence API (JPA) is the way to go, and there is a few recent implementations of it that supports the latest standard (which I believe is 2.1). There also seems to be some variation in the functionality of the implementations out there, from simple and easy to implement to robust (think: lazy loading, relations support, etc) but could be a bit trickier to manage.

I do have experience with Entity Framework, and while that’s not a Java technology, it has similar concepts such as entity repositories, annotiaons to describe field mappings, etc.

I’d like to hear what people would recommend to use, and what they think would be the quickest to implement. We’re at the point in CIRCE and HERACLES development that we’re going to need to save/load data from the database, and I’m not sure a decision has been made yet on how that will happen.

-Chris

jon_duke · February 7, 2015, 7:56pm

Chris,

I can’t offer a technical recommendation here, but it seems that this again stirs the question about whether we should ultimately support OHDSI applications living both independently (i.e., build your own WebAPI and download the various apps you want) and as a single release (e.g., launchable WAR containing embedded server, WebAPI, launcher, apps, etc).

In the latter scenario, we can be confident that users of Hermes, Circe, Heracles, etc will be working in a known environment (because we built it for them) and we can deploy an embedded database like H2 to persist certain application level stuff. With each release, we add new applications, add the tables to support them, add any new configuration requirements, etc.

The original idea behind OLYMPUS was a unifying environment to help simplify deployment for users who may be less technically sophisticated. But practically, having such a release could provide value to all users and make development a bit easier because we control the whole show. The downside of course is that our current model of downloading the WebAPI, building it, getting the HTML/JS for an app like Hermes or Circe, configuring, and hitting index.html, would be trickier to pull off as things get more complicated. New applications wouldn’t be as easily runnable until they were included in a release.

So I’m sure there are a number of ways to skin this cat. But I’d like to throw out there the question of, do we envision the long-term need to support people who just want to download and run a few parts of our platform? If we can drive people to installing a unified base, and just using the apps they want, it may make things better for everyone.

Jon

Chris_Knoll · February 7, 2015, 10:37pm

Hi, Jon,
Put simply, this is just a question of what we want to see in code. Do we write insert statements by hand, or do we depend on an object-persistence framework.

Do we write 'insert into XX (col1, col2, col3, col4) values (a, b, c ‘speciallyEscapedSqlValueThatDependsOnSpecialDBSpecfificImplementation’)

Or do we want to see something like this:

newCohortDef = new CohortDefiniton(id, name, syntax);
CohortDefinitionRepository.save(newCohortDef);

Whether the physical datastore is H2, derby, mssql, oracle, etc, the question about using JPA abstracts all of these details from us. So, to your point about different user experiences or technical challenges of getting the applications set up, in any context you can think of, we’re going to need to be able to save data to database, load data from database. In this question, I’m asking that we think of these save/loads separately from the sqlRender scripts that you’ve seen for Heracles or the sql i’m building for Circe. these are special report-type (or algorithm) scripts, not dealing with the static domain objects that we’re going to have in each application.

Hope that helps.

-Chris

aguynamedryan · February 7, 2015, 11:26pm

There is a middle ground. On the spectrum of pure SQL on one end and pure ORM on the other, there are frameworks/libraries like Ruby’s Sequel (http://sequel.jeremyevans.net/) and Python’s SQLAlchemy (http://www.sqlalchemy.org/). I’m not sure where JPA sits on this spectrum (I’m stuck at an airport with awful wifi). At one time I think the CIRCES group discussed jOOQ (http://www.jooq.org/) which I assume is similar to Sequel.

Basically these libraries allow a programmer to build a query through calls to the library, and then the SQL is translated to the correct SQL dialect when it’s time to execute the query. You can write SQL that is much more performant than an ORM generates, and not worry about the nuances of quoting/escaping the SQL for each different RDBMS.

For situations where we are faced with writing SQL from within code, I think we should seek to use one of these kinds of libraries.

Ryan

alfranke · February 7, 2015, 11:47pm

I would vote definitely JPA (hibernate as JPA provider) with Spring Data (take a look at CrudRepository).

http://docs.spring.io/spring-data/data-commons/docs/1.6.1.RELEASE/reference/html/repositories.html

Chris_Knoll · February 8, 2015, 4:44pm

@aguynamedryan, Right. for writing complex, cross platform queries, those tools you describe would work for that. Has constructs in it to represent joins, etc etc that allow you to construct queries.

For purposes of this discussion, I’d like to focus on on the specific database interaction of saving a discreet unit of information (call it a model, entity, domain object, whatever), and the framework that will make it possible. From the context of Jooq, sequel, sql alchemy, it’s all about creating sql statements that will transform into the target dialect, but at the end of the day, you still need to construct the insert statements, and manage the order of inserts (in cases where there are 1-many relationships and referential integrity,etc). In the frameworks that@alfranke is referencing (and that I’d like us to investigate) you don’t write any direct sql logic at all, you just represent your objects and define how they map to a table, and the framework works out order of inserts, transforms to dialects, etc.

I think that there is a possible separate discussion here about creating cross-platform sql batch statements, and that would be a fun topic, but for here, the immediate need as I see it is that we need a way to perform CRUD operations on our database saving 2 immediate domain objects that have been defined: cohortDefinitions, and concept sets. I’m not sure what Jon’s team has worked out as far as saving and loading cohort definitions (I saw code that rowmaps out of a table query, but i’m not sure if that’s finalized or a stub for something I’m supposed to come up with). So, that’s why I’m here now, asking for people’s experiences with ORM-like related tech.

@alfranke, since we’re already on spring boot, is it an easy lift to incorporate hibernate? I’ve used tools like hibernate before, so I can get up to speed quickly, my only question about is is that I thought there were multiple JPA providers under spring, and was wondering why you’d pick Hibernate over another one?

-Chris

cahilton · February 8, 2015, 5:48pm

@Chris_Knoll The stuff we have for cohort definition is just what we used to get started, and it followed the pattern that the rest of the project was using. I think our understanding from our first conversation was that OHDSI was using SqlRender over JPA, so just trying to follow that pattern. If the we use SqlRender only for analyses, I think that may be a good logical division, and might open the door for a more common platform for modelling the CDM that @alfranke has brought up before.

alfranke · February 8, 2015, 6:18pm

Very easy to incorporate Spring Data / JPA. Hibernate is pretty much the de facto standard.
I can add some reference usage into the example application package that I introduced into the spring batch (job server) branch.

What do you think about me adding in the property to disable the spring batch ddl auto gen and merging in the job server support pull request? The thought is that I could create separate branches for database migration support and JPA, so that the pull requests are a bit smaller in scope.

Chris_Knoll · February 8, 2015, 6:52pm

I think what your’e describing is the way forward, so let’s just pack it up and merge it in.

@aguynamedryan, I think there’s still a place for the tools you mentioned, I think as the application gets larger and we identify areas for code refactoring, we might need some utlities that help import data or refactor data around. Sounds like a migration management thing, but it could be mroe than that, and it may involve more complex sql gymnatics to peform, and having a tool like sequel or jooq to abstract those sql ops in a database agnostic form, would be useful! But when the time comes, we’ll discuss it.

For now…version 0.0.1, ho!

-Chris

Frank · February 8, 2015, 9:38pm

Will this framework create tables that don’t exist before attempting to persist the entities?

alfranke · February 8, 2015, 10:53pm

Only if ddl-auto is enabled.

Frank · February 9, 2015, 5:14am

Ok. Will we have it enabled?

alfranke · February 9, 2015, 5:39am

Well, given the direction the thread about database migrations is going, I think not. I think we will defer ddl to the database migration/versioning tool (e.g. Flyway), possibly under a separate “admin” like datasource. This direction implies we will specify the ddl for each vendor in a resource file, and the tool will create/modify the db objects appropriately, either upon startup or separate action (e.g. maven goal to “migrate” database).
Ddl-auto is rarely used in production deployments, though it is useful in development. So, it will likely be a property we can toggle during development.

Chris_Knoll · February 23, 2015, 3:58pm

Ok, I’ve played with the jpa branch locally, and I think I have a grasp of the basics of how to define a JPA entity for persistence:

Create a class, tag it with @Entity specifying the table name to persist the entity into in the name attribute.
Make a repository class that extends the spring framework’s CrudRepository interface. It appears that you can extend the CrudRepository to add other methods (in Widget example: findByNameContainingIgnoreCase, but i do not see where this is implemented.)
Use the repository by specifying a @Autowired field of type {repository Class}. I’m not exactly clear what work the @Autowired attribute does in this context, considering it is referencing an interface.

I’m assuming some of the magic here is being performed by the spring framework using reflection and runtime implementation of some of these abstract methods. I’m not sure I need to know how it works… Also, I don’t know yet how to mange entity associations (one to many relationships) but for now, I don’t think we actually have that case that we need to solve.

With this pull request, we have an approach for introducing persisted entities into the webAPI. I’d like to jump right on that by persisting cohort definitions, and I know @cahilton has a need to persist cohort generation results. Not sure if cohort characterization results should be entities, or even the materialized cohorts. What is everyone’s plan for JPA entities?

-Chris

Chris_Knoll · February 24, 2015, 3:42pm

@alfranke, I’m working with the example app for persisting ‘widgets’ to the database created by flyway. I believe there’s some mis-configuration related to how the identity column is being used, but I can’t see the generated SQL for the insert statement (even tho i see spring.jpa.show-sql = true in the applicaton.properties):

I’m posting to (but for saves shouldn’t this be PUT? it is @Post in the function, however)
http://localhost:8084/WebAPI/example/widget

Payload is application/json:
{
“name”:“Test Widget”
}

I can trip a breakpoint in the /wiget (POST) handler, the object is serialized up with a NULL id, and then repository.save() is called. Assuming that a null ID means ‘create new’, i’d expect it to just create.

Error returned in the output:
2015-02-24 10:26:39.027 ERROR http-nio-8084-exec-12 org.hibernate.engine.jdbc.spi.SqlExceptionHelper - - Cannot insert the value NULL into column ‘ID’, table ‘CDM_TRUVEN_CCAE_6k_V5_OHDSI.dbo.EXAMPLEAPP_WIDGET’; column does not allow nulls. INSERT fails.

Ok, so the Widget has this defined on the ID field:
@Id
@GeneratedValue
private Long id;

It is unclear to me which generation strategy will be adopted by the JPA provider. It looks like oracle will use sequencers, mssql will use an identity column. I’m considered changing it to have a generation strategy of IDENTITY, but I think that would break what you have in the oracle implementation.

Another issue is that I can’t see the generated sql to determine what the INSERT looks like, even tho the show sql settings is true. Do you know why it’s not showing me the SQL?

And a final annoyance here is that instead of receiving some 500 error from the service, I’m getting a 404.Any idea why this is? I’d much rather get ‘there was a server error’ status vs. ‘this resource was not found, but otherwise everything is fine’.

-Chris

Chris_Knoll · February 24, 2015, 5:19pm

@alfranke, update: I set the ID column in WIDGET table to be Identity, and it inserted the value (and injected the newly generated ID back into the object in the result, which is nice). Just the sqlServer migration scripts is incorrect (in that it creates a hibernate_sequence table, and the widget talbe id column isn’t identity).

I think we need to work out a standard…

alfranke · February 24, 2015, 5:37pm

I thought that was bizarre when I saw that in the sql server scripts. I basically looked at the Spring Batch scripts for tables and sequences syntax for the non-oracle databases and for sql server they were using tables instead of sequences in oracle, so I figured that must be how sql server works. Brings up a question about whether we should have a test suite to perform across db vendors…
So, to clarify, you changed the db/migration/sqlserver jpa script to specify a sequence for hibernate_sequence? Looks like sequences are new to sql server as of sqlserver2012? Or did you keep the hibernate_sequence as a table and added identity to the EXAMPLEWIDGET.ID column?

Chris_Knoll · February 24, 2015, 6:41pm

I added identity to the example_widget.id column directly, haven’t figured out what to do about the migration scripts yet, It appears for mssql server using the default @GenerationStrategy, i believe it uses IDENTITY. Therefore, i thnk the hibernate_sequence is unecessary here.

However: in sql 2012, it appears that sequences are supported. So we have a choice: do we explicitly state the generation strategy in our entites (and use GenerationStrategy.SEQUENCE and we’ll have limited support for which databases support sequences (which I think everone does that we care about?), Or we can use a Table generation strategy with a HiLo hint where the JPA manager maintains a pool of sequences that’s maintained in a database table. This would be the most cross platform approach, but wouldn’t leverage any database level optimizations (same can be said about not using IDENTITY).

The other challenge that coems out of this (and we have to cross this bridge sometime) is dealing with pre-existing database entities that flyway created and how to manipulate them after deployment. I think the’d recommend you just stack ALTERs on your scripts to manipulate the stuff. But if we’re making a lot of little tweaks it’s gonna feel noisy. This isnt’ a liquibase vs. flyway issue, this is just what’s going to mass over time with migration scripts. Not sure if I should really worry about it, it’s kinda nice to have a database changelog…

-Chris

alfranke · February 24, 2015, 8:07pm

I fixed the show-sql issue. Thanks for pointing that out.
The 404 looks like it is the default jersey exception/status mapping for unknown errors…? I tried adding a general exceptionmapper but it resulted in every status being 500, which is undesirable. So, we may have to map specific high level exceptions (JPA & Jdbc root level). What was the exception you were getting (e.g. javax.persistence.PersistenceException) in the logs?
https://jersey.java.net/nonav/documentation/1.12/jax-rs.html#d4e435

Regarding Flyway, though I agree the changelog could grow quite large, it is that which provides the ability to do auto migrations. Flyway definitely recommends not making any changes directly to the DB.

alfranke · February 24, 2015, 8:22pm

Regarding JPA Entities… I think the @GeneratedValue default (AUTO) is probably good. Looks like it was my mistake in the migration script.
To test your change (adding identity to @Id column), you will need to delete the schema_version row that contains the jpa migration, and delete the objects created by the script (reference db/migration/sqlserver/rollback/1.0.0.2__schema_drop.sql).