OHDSI Home | Forums | Wiki | Github

Let's talk about Persistance

@aguynamedryan, Right. for writing complex, cross platform queries, those tools you describe would work for that. Has constructs in it to represent joins, etc etc that allow you to construct queries.

For purposes of this discussion, I’d like to focus on on the specific database interaction of saving a discreet unit of information (call it a model, entity, domain object, whatever), and the framework that will make it possible. From the context of Jooq, sequel, sql alchemy, it’s all about creating sql statements that will transform into the target dialect, but at the end of the day, you still need to construct the insert statements, and manage the order of inserts (in cases where there are 1-many relationships and referential integrity,etc). In the frameworks that@alfranke is referencing (and that I’d like us to investigate) you don’t write any direct sql logic at all, you just represent your objects and define how they map to a table, and the framework works out order of inserts, transforms to dialects, etc.

I think that there is a possible separate discussion here about creating cross-platform sql batch statements, and that would be a fun topic, but for here, the immediate need as I see it is that we need a way to perform CRUD operations on our database saving 2 immediate domain objects that have been defined: cohortDefinitions, and concept sets. I’m not sure what Jon’s team has worked out as far as saving and loading cohort definitions (I saw code that rowmaps out of a table query, but i’m not sure if that’s finalized or a stub for something I’m supposed to come up with). So, that’s why I’m here now, asking for people’s experiences with ORM-like related tech.

@alfranke, since we’re already on spring boot, is it an easy lift to incorporate hibernate? I’ve used tools like hibernate before, so I can get up to speed quickly, my only question about is is that I thought there were multiple JPA providers under spring, and was wondering why you’d pick Hibernate over another one?

-Chris

@Chris_Knoll The stuff we have for cohort definition is just what we used to get started, and it followed the pattern that the rest of the project was using. I think our understanding from our first conversation was that OHDSI was using SqlRender over JPA, so just trying to follow that pattern. If the we use SqlRender only for analyses, I think that may be a good logical division, and might open the door for a more common platform for modelling the CDM that @alfranke has brought up before.

Very easy to incorporate Spring Data / JPA. Hibernate is pretty much the de facto standard.
I can add some reference usage into the example application package that I introduced into the spring batch (job server) branch.

What do you think about me adding in the property to disable the spring batch ddl auto gen and merging in the job server support pull request? The thought is that I could create separate branches for database migration support and JPA, so that the pull requests are a bit smaller in scope.

I think what your’e describing is the way forward, so let’s just pack it up and merge it in.

@aguynamedryan, I think there’s still a place for the tools you mentioned, I think as the application gets larger and we identify areas for code refactoring, we might need some utlities that help import data or refactor data around. Sounds like a migration management thing, but it could be mroe than that, and it may involve more complex sql gymnatics to peform, and having a tool like sequel or jooq to abstract those sql ops in a database agnostic form, would be useful! But when the time comes, we’ll discuss it.

For now…version 0.0.1, ho!

-Chris

Will this framework create tables that don’t exist before attempting to persist the entities?

Only if ddl-auto is enabled.

Ok. Will we have it enabled?

Well, given the direction the thread about database migrations is going, I think not. I think we will defer ddl to the database migration/versioning tool (e.g. Flyway), possibly under a separate “admin” like datasource. This direction implies we will specify the ddl for each vendor in a resource file, and the tool will create/modify the db objects appropriately, either upon startup or separate action (e.g. maven goal to “migrate” database).
Ddl-auto is rarely used in production deployments, though it is useful in development. So, it will likely be a property we can toggle during development.

Ok, I’ve played with the jpa branch locally, and I think I have a grasp of the basics of how to define a JPA entity for persistence:

  1. Create a class, tag it with @Entity specifying the table name to persist the entity into in the name attribute.
  2. Make a repository class that extends the spring framework’s CrudRepository interface. It appears that you can extend the CrudRepository to add other methods (in Widget example: findByNameContainingIgnoreCase, but i do not see where this is implemented.)
  3. Use the repository by specifying a @Autowired field of type {repository Class}. I’m not exactly clear what work the @Autowired attribute does in this context, considering it is referencing an interface.

I’m assuming some of the magic here is being performed by the spring framework using reflection and runtime implementation of some of these abstract methods. I’m not sure I need to know how it works… Also, I don’t know yet how to mange entity associations (one to many relationships) but for now, I don’t think we actually have that case that we need to solve.

With this pull request, we have an approach for introducing persisted entities into the webAPI. I’d like to jump right on that by persisting cohort definitions, and I know @cahilton has a need to persist cohort generation results. Not sure if cohort characterization results should be entities, or even the materialized cohorts. What is everyone’s plan for JPA entities?

-Chris

@alfranke, I’m working with the example app for persisting ‘widgets’ to the database created by flyway. I believe there’s some mis-configuration related to how the identity column is being used, but I can’t see the generated SQL for the insert statement (even tho i see spring.jpa.show-sql = true in the applicaton.properties):

I’m posting to (but for saves shouldn’t this be PUT? it is @Post in the function, however)
http://localhost:8084/WebAPI/example/widget

Payload is application/json:
{
“name”:“Test Widget”
}

I can trip a breakpoint in the /wiget (POST) handler, the object is serialized up with a NULL id, and then repository.save() is called. Assuming that a null ID means ‘create new’, i’d expect it to just create.

Error returned in the output:
2015-02-24 10:26:39.027 ERROR http-nio-8084-exec-12 org.hibernate.engine.jdbc.spi.SqlExceptionHelper - - Cannot insert the value NULL into column ‘ID’, table ‘CDM_TRUVEN_CCAE_6k_V5_OHDSI.dbo.EXAMPLEAPP_WIDGET’; column does not allow nulls. INSERT fails.

Ok, so the Widget has this defined on the ID field:
@Id
@GeneratedValue
private Long id;

It is unclear to me which generation strategy will be adopted by the JPA provider. It looks like oracle will use sequencers, mssql will use an identity column. I’m considered changing it to have a generation strategy of IDENTITY, but I think that would break what you have in the oracle implementation.

Another issue is that I can’t see the generated sql to determine what the INSERT looks like, even tho the show sql settings is true. Do you know why it’s not showing me the SQL?

And a final annoyance here is that instead of receiving some 500 error from the service, I’m getting a 404.Any idea why this is? I’d much rather get ‘there was a server error’ status vs. ‘this resource was not found, but otherwise everything is fine’.

-Chris

@alfranke, update: I set the ID column in WIDGET table to be Identity, and it inserted the value (and injected the newly generated ID back into the object in the result, which is nice). Just the sqlServer migration scripts is incorrect (in that it creates a hibernate_sequence table, and the widget talbe id column isn’t identity).

I think we need to work out a standard…

I thought that was bizarre when I saw that in the sql server scripts. I basically looked at the Spring Batch scripts for tables and sequences syntax for the non-oracle databases and for sql server they were using tables instead of sequences in oracle, so I figured that must be how sql server works. Brings up a question about whether we should have a test suite to perform across db vendors…
So, to clarify, you changed the db/migration/sqlserver jpa script to specify a sequence for hibernate_sequence? Looks like sequences are new to sql server as of sqlserver2012? Or did you keep the hibernate_sequence as a table and added identity to the EXAMPLEWIDGET.ID column?

I added identity to the example_widget.id column directly, haven’t figured out what to do about the migration scripts yet, It appears for mssql server using the default @GenerationStrategy, i believe it uses IDENTITY. Therefore, i thnk the hibernate_sequence is unecessary here.

However: in sql 2012, it appears that sequences are supported. So we have a choice: do we explicitly state the generation strategy in our entites (and use GenerationStrategy.SEQUENCE and we’ll have limited support for which databases support sequences (which I think everone does that we care about?), Or we can use a Table generation strategy with a HiLo hint where the JPA manager maintains a pool of sequences that’s maintained in a database table. This would be the most cross platform approach, but wouldn’t leverage any database level optimizations (same can be said about not using IDENTITY).

The other challenge that coems out of this (and we have to cross this bridge sometime) is dealing with pre-existing database entities that flyway created and how to manipulate them after deployment. I think the’d recommend you just stack ALTERs on your scripts to manipulate the stuff. But if we’re making a lot of little tweaks it’s gonna feel noisy. This isnt’ a liquibase vs. flyway issue, this is just what’s going to mass over time with migration scripts. Not sure if I should really worry about it, it’s kinda nice to have a database changelog…

-Chris

I fixed the show-sql issue. Thanks for pointing that out.
The 404 looks like it is the default jersey exception/status mapping for unknown errors…? I tried adding a general exceptionmapper but it resulted in every status being 500, which is undesirable. So, we may have to map specific high level exceptions (JPA & Jdbc root level). What was the exception you were getting (e.g. javax.persistence.PersistenceException) in the logs?
https://jersey.java.net/nonav/documentation/1.12/jax-rs.html#d4e435

Regarding Flyway, though I agree the changelog could grow quite large, it is that which provides the ability to do auto migrations. Flyway definitely recommends not making any changes directly to the DB.

Regarding JPA Entities… I think the @GeneratedValue default (AUTO) is probably good. Looks like it was my mistake in the migration script.
To test your change (adding identity to @Id column), you will need to delete the schema_version row that contains the jpa migration, and delete the objects created by the script (reference db/migration/sqlserver/rollback/1.0.0.2__schema_drop.sql).

The exception chain was:
javax.servlet.ServletException:
Caused by: javax.persistence.PersistenceException:
Caused by: org.hibernate.exception.ConstraintViolationException: could not execute statement
Caused by: com.microsoft.sqlserver.jdbc.SQLServerException: Cannot insert the value NULL into column ‘ID’, table ‘CDM_TRUVEN_CCAE_6k_V5_OHDSI.dbo.EXAMPLEAPP_WIDGET’; column does not allow nulls. INSERT fails.

Re: flyway logs. I just pushed a sql server specific migration update (version 1.0.0.2.1) to set the identity column (while persisting any existing data) and to not do anything at all if the column is already identity. Also droped the HIBERNATE_SEQUENCE table because I don’t think we need it (and only try to drop it if it exists).

As far as subsequent flyway migrations, I’m wondering if makes sense to approach database updates the same way as feature branches (assuming your feature requires db changes):

  1. Make your feature branch.
  2. Make a pull request out of the new branch. Put in the comments what the expected database changes are, and ‘claim’ a shema version number (I’d vote for a numbering scheme {major}.{minor}.{release}.{feature_id}. Include int his pull request a blank file with the associated schema revision number. (this is so that other devs working on other feature branches who also need schema changes can make a migration file name that doesn’t conflict)
  3. Start working on the feature, as more and more schema updates are made, the developer with the feature branch will need to remember to delete the schema_version row for this feature so that all the database changes are up to date between pulls. Make the migration script idempotent so that you don’t have to rollback all the current feature branch schema changes just to ‘ammend’ the new ones.
  4. Once the feature branch is complete, and it gets merged in, the single feature ddl script will be folded into master, and other devs will only see 1 schema migration update for the new feature. I think this is better than having multiple files.

So, I sorta demonstrated this with my most recent checkin (introduced a new migration version 1.0.0.2.1 to address this update and made it re-runnable). THere are 2 units of work in this 1 migration script: migrate the ID column to identity(1,1) and drop the hibernte_sequence table. For feature branches like the cohort definition tables that me and Charity will work on, i’ll put that under 1.0.0.2.2 (or 1.0.0.3.1 of we think this is a different release…version numbering schemes are very stylistic…)

One last thing: I also enabled ‘out of order’ migrations to be ready for when feature branches do come in at any order, we will be able to apply the migration scripts.

Here’s the article I read about this technique:

-Chris

I wasn’t thinking you would add an alter script to fix the bad column and drop the table since I thought you might be the only ones using sql server and could just “repair” your “failed” migration - but it brings up an interesting topic… and with that said, probably the right call and a good habit to get into.

Not sure I grasp yet your comment about the ‘blank file with the associated schema revision number’.

Interesting article. Not sure I understand the use case to support re-running the migration scripts and making them idempotent. I thought the idea is to run once and only once (rollback manually if necessary and reapply). This seems like unnecessary overhead and less concise, but that’s my opinion…

I understand why you added the ‘out of order’ flag, but thought I would refer readers to this page so that it is understood the consequences of running with ‘out of order’ support.

Ok, I’ll read it. thanks!

The use case is if you have a complicated feature build that adds say 3 tables, alters 2 columns in some other tables, and adds a couple indexes, let’s say everything gets done perfectly except 1 stupid typo in the last index statement. Then, what you have to do in order to get your 3 tables, 2 colums and 1 index to ‘re run’ from migration is you have to drop 3 tables, un-alter 2 columns and drop 2 indexes.

Instead, if each part of the migration said ‘if table does not exist, create’… and ‘if column isn’t X’…alter. If index doesn’t exist…create, then you wouldn’t have to unroll the other 7 changes just to fix that last one.

I’m speaking from very recent experience trying to get the identity column to convert without loosing data in the table in my last push: without making it idempotent, i had to open sql management stdio, alter the table back to it’s origional form, then delete the schema record to have the migration apply again. Then something else I messed up (but it did get tot he part about altering the table), so to ‘try again’ i have to unroll the amount of the migration that did work so that I could try to re-apply again. It’s just an idea, if it doesn’t seem helpful no one is forcing anyone to write the migration scripts that way. Just a suggestion…

Out of order isn’t without it’s pitfalls. I understand that if order of migration matters you could wind up in a world of hurt. That’s just something that we’d all need to be aware of. Or we just disable out of order migrations and everyone needs to work within a single ‘database migration branch’ so that everyone is synced to the same order.

Re: blank file: that was just a way to put in a ‘place holder’ so that people know what feature branch maps to which migration file. If they accidnetly used the same file, you’d hae a merge and I don’t know what that would do (maybe it would be fine, both migrations would get smooshed together, but I dont’ trust the merging algorithm to understand that the files merged together should be considered ‘appended’ to each other and not intermingled.

-Chris

mmmmhmmmmm … this is me rubbing my temples. Why did you have to show me that article? :flushed:

So, they make a good point. perhaps the answer is that we specify out of order = FALSE as the default but in hotfix deploys you specify outOfOrder = TRUE. I believe that if flyway sees a new migration script but the version number is behind the ‘latest’ version of the schema version, it ignores it.

I’ll do some experiments with it and see what I come up with.

Entity Framework is going to become part of PCORnet architecture. Just an FYI.

t