OHDSI Home | Forums | Wiki | Github

Transaction vs Session Scope for Global Temp Tables Statements

While using output produced by SqlRenderer/Translate, which contained many statements including Global Temp Tables (GTT), we began to notice session locking in Oracle.

I believe this has to do with connection pooling and oracle-specific GTTs. Statements produced by SqlRenderer/Translate use “on commit preserve rows”. From what I’ve read, in Oracle, this should be “on commit delete rows”, so that clean up happens at the transaction level, not the session. Sessions are being reused and I believe this is likely causing the contention we are seeing.
Options include:

  • Changing oracle-specific GTT statements within SqlRenderer/Translate
    to use “on commit delete rows” – to specify transaction-scoped GTTs.
  • Changing WebAPI to not use connection pooling and open/close
    connection for every operation.

Ideally, we would NOT want to restrict clients of SqlRenderer/Translate from using connection pooling, so what is the feasibility of changing to transaction-scope GTTs vs session?

I found this discussion about comparing sql temp tables vs oracle global temps, and it’s not really about which is better, rather solely about how each works.

(No idea why preview says ‘do you need temporary tables?’, but its about talking about temp tables in oracle and sql)…

I know you have the knowledge and expertise of the oracle GTTs, Alex, I’m thinking the challenge here is figuring out to construct a templateSql statement that will respect the semantics of both.

I think in oracle the intention is that temporary tables are created up front using specific DDLs, and this keeps the DDL out of the process execution (something I think from an Oracle perspective is preferred). Naturally in SQL server, ddl is right in the process execution to create the temp table. So therein lies the problem.

I’m wondering if in the SqlRender if we see a temp table being created, we remove the table creation entirely from the oracle translation…but I think your thought about your first option might be easier to implement. but I do think that in the oracle case, global temprary tables shouldn’t be created on the fly. On the other hand, if we do something in the sqlRender where we give a unique name to a global temporary table and then create the table during the script execution (and drop it later) then it’s basically the same thing as temp tables on sql server: only one person will ever see the table. And at this point, it doesn’t matter if the table is global temporary or a permenant table that you create, if the script creates the table then drops it later, I worry a little bit abotu what happesn if a process fails (sql temp tables will be garbage collected, but permenant tables that we create to ‘mock’ a sql temp table will not)…

-Chris

Using ON COMMIT DELETE ROWS would not work for most of our applications, because they require the data to survive over multiple transactions (but still in the same session).

In an earlier discussion we were already contemplating whether we should use temp tables at all, given the fact that Oracle doesn’t really support them, and the SqlRender vignette already urges users not to use temp tables.

Can someone tell me what part of the WebApi is making so much use of temp tables, and why?

At least for the Heracles team, here’s the script that causes this problem:

HERACLES_COHORT is the temp table.

Cohort definition will make use of temp tables. Concept Sets (which could expand into thousands of concepts) will dump their codes into a temp table for exectuion.

During Cohort materialization, each sub-clause creates a temp table, which is then intersected across the other sub-clauses to create the final matching population.

Removing temp tables would take a way a very large tool from our toolbox.

-Chris

@schuemie, with HERACLES we have a slightly different issue.

HERACLES creates, uses, then drops a global temporary table called HERACLES_COHORT. We could use a regular table there as well, no major issue. But where we run into problems is with simultaneous HERACLES jobs.

HERACLES runs can take hours, and all HERACLES runs create tables in the same schema (in our case, the OHDSI schema). So if a user wants to analyze a different cohort or another user wants to run a job, it will get messed up because OHDSI.HERACLES_COHORT already exists (whether as a global temp table or a regular table).

So ideally, Oracle would work just like SQL Server and we’d have session specific tables and that would be that. But I feel like Oracle does not make this easy.

Of course, for a given job, it really doesn’t matter if our table is called HERACLES_COHORT or RUMPLESTILSKIN or TABLE_123 since it is just created and dropped. So one (annoying but feasible) solution would be for any temporary tables to have their names declared as a variable up top and if Oracle it basically generate a random table name (< 30 characters). Otherwise the default would be the standard name HERACLES_COHORT.

This seems beyond ridiculous and I hate to even propose it, but there you go.

I understand that this discussion is comparing sql server and oracle, but does anyone have any background on how other db systems manage temp tables? On Postgres, when you create a temp table, is it like oracle in that: a temp table definition is defined, but each session has a local instance to write to, or is it like sql: where temp tables resemble any other table that you create and drop but temp tables are special that they are dropped after session close?

Just want to make sure that we’re arriving at a general purpose solution or not…not that a general purpose solution could work here, just saying that it would be preferred.

-Chris

@Chris_Knoll:

But why do we need them? why do we have to come to a conclusion? Can’t we just create tables (proper tables) and drop them after we are done? Or is there a special reason other than convenience that somebody is cleaning up after you?

Well, a few reasons:

  1. the nice thing about the temp tables is that if two different sessions create the same temp table name (like #codesets) there is no table collision because those tables are only seen in the local session.

  2. If something fails in the middle of a batch, you have to implement custom cleanup logic to delete any table you created (and only the tables you created), whereas temp tables are managed by the DBMS engine and cleaned up automatically

  3. The DBMS manager can take special considerations when dealing with temp tables with regards to locking (only the local session has access to the temp table) so a lot of concurrency overhead can be eliminated making processing more efficient.

  4. Temp tables are usually short lived, and relatively small in size, so they can be allocated in memory and then dumped to disk if they get to large, whereas permanent tables get persisted (and logged) always.

  5. Repeated creating/dropping of tables in a database can lead to fragmentation. But this is not so much of an issue because we’ll be adding/deleting rows in non-temporary tables in the OHDSI framework anyways, but using temp tables just leads to ‘less’ db fragmentation over time.

  6. Usually you don’t need special grant permissions on a database to create tables in temp space. Allowing this type of permissions in the normal database space usually makes the database administrators nervous.

-Chris

@Chris_Knoll: PostgreSQL, SQL Server, RedShift, and the Microsoft PDW all have the same behavior when it comes to temp tables. Only Oracle is the bad kid in class.

I came to the same conclusion as @jon_duke (and I think @Chris_Knoll also mentioned it): for Oracle we need to give temp tables unique names.

What would be the least work for everyone is if I modify the translateSql() function so for Oracle, it adds a ‘session ID’ to the name of temp tables, and this session ID should be unique. So

CREATE TABLE #myTempTable (x INT);

would translate to

CREATE GLOBAL TEMP TABLE MyTempTable_xyz123 ON COMMIT PRESERVE ROWS (x INT);

However, we need to think on how this session ID is generated and where, because multiple calls to translateSql() would need to use the same ID (e.g. when we first create a temp table, then read from it, and then need to drop it). In R, we could generate a random session ID per R session (so one per JVM instance). In the WebAPI I don’t know what the correct scope would be. Ideas?

My suggestion is: you look in sqlTranslate for #table for temptables, and you also have to seek out the corresponding drop #table, so if you’re in java anyways, i’d say tack on getCurrentTimeMillis() or something liek that to the end of the table. It should be < 30 chars long so give you a good table name.

EDIT: sorry, i didn’t undersatnd hat you said about mulitple sqlTranslates…hmm I’ll have athink on it.

Actually,l maybe it will still work, you just generate the getCurrentTime milis as a thread-local variable at the start, and use that number as a suffix for all temp tables. Individual temp tables needt o be unique anyways, right?

-Chris

Ok, forget that idea. What about if sqltranslate was an actual instance class, that when you create the class you set up a member var intialized to a key, and use that key (still curretn time milis) to do your temp table suffixing

@schuemie , I thought it might be the case that one of these scripts would be managing its own transactions. However, I don’t see any evidence of this in the script we were using (Charity posted it to this thread). Can you point me to one that does manage its own transactions?

Yeah, I’m unclear of the use case around the comment about needing to know the sessionID for multiple calls to translateSql. In the case I’ve seen, translate is called once, then split, executed, done. Your example of “create temp table, read from it, then drop it”, isn’t that a single translateSql invocation? Wouldn’t the state reside within the template sql file you are translating? That appears to be in line with the example I have seen.

You’re likely already familiar with this but in oracle, you could consider the userenv(‘sessionid’).
If you need the sessionId across translateSql invocations…

    @GET
    @Produces(MediaType.APPLICATION_JSON)
    public String doIt() {
        final String sessionId = getTransactionTemplate().execute(new TransactionCallback<String>() {
            
            @Override
            public String doInTransaction(final TransactionStatus status) {
                String sessionId = getJdbcTemplate().queryForObject("select userenv('sessionid') from dual", String.class);
                log.info("sessionId: " + sessionId);
                //translateSql
                //execute
                //translateSql
                //execute
                String sessionId2 = getJdbcTemplate().queryForObject("select userenv('sessionid') from dual", String.class);
                log.info("sessionId2: " + sessionId);
                Assert.isTrue(sessionId.equals(sessionId2));
                return sessionId;
                
            }
        });
        log.info("Ret session: " + sessionId);
        
        //new sessionId
        return this.getJdbcTemplate().queryForObject("select userenv('sessionid') from dual", String.class);
    }

Chris’ idea of making SqlTranslate stateful seems like a good idea. I just don’t have a good understanding of the case you are describing, nor transaction practices.

Note: using the database sessionId could be another limiting factor from using connection pooling (from a temp table uniqueness perspective)…

Here is another HUGE advantage of temp tables:

Typically even a user with only read rights (typical warehouse user) has ability to create temp tables but does not have any “personal full schema/databse” for his interim tables.

So temp tables are great because almost everyone can create them.

If we kill them, many users will have to ask their admins for modifications to their logins that have rights to some “result_schema”

Whereas temp tables solve this issue nicely.

For example, in my institution, on production server, I have no “personal schema” (like on the dev server).
So temp tables are a blessing for queries on production.
Let’s try to keep them in our scope.

@alfranke , another thought on this: could it be the deadlocks that you are seeing could be that the script is attempting to drop/recreate the global temp table and it is that drop/recreate that needs to block because there is an open transaction reading that table in some context? if you were to alter the script (in a test, not to alter it in github) so that it doesn’t drop/create the global temp table and then run 3 threads, i’m wondering if they would block each other if they didn’t execute the drop. From everything I have read, they shouldn’t because each thread will have it’s own context (unless it’s the case that connection pooling actually shares the same connection instance across callers, then we’d have issues).

Now, on the matter of ensuring each call uses it’s own context, have you tried doing a beginTransaction at the start of the batch and then a .commit after the split SQL is executed? This way, we could set the ‘on commit delete rows’, the rows will live through each iteration of split sql (becuase the transaction will still be active), i BELIEVE that in a connection pool context that if someone starts a transaction, that makes the connection exclusive to the caller such that another person grabbing a connection from the pool will not get the connection enrolled in a transaction. If that’s the case, then you can go back to enabling connection pools.

-Chris

Yes, we are using transactions and were using pooling, which is why I suggested changing to use ‘on commit delete rows’. @schuemie indicated he thought that wasn’t an option. I’m still trying to get clarification on why (I.e. what use cases / scripts manage their own transactions - explicitly commit for example).
Yes, It is likely that the locking was due to the create/drop attempts and previous session/connection pooling and ‘on commit preserve rows’ behavior.

I think because in the R context there’s no beginTransaction/commit avaialble int he api to call, so from R’s perspective, it always commits after each batch statement.

This is only a guess.

@alfranke: Most often we use the temp tables like this:

  1. Create the temp table
  2. Do some inserts and deletes
  3. (Create other temp tables by joining with the first temp table)
  4. Pull the data to the client using a SELECT query
  5. Drop the temp table

Note that in the current SqlRender implementation, you can’t do this in a single call, because the data pull (4) needs to be in a call of its own. Also, to my knowledge, those 5 steps can’t be combined into a single transaction (at least the DELETE would cause problems without commit). Furthermore, it seems to me the real problem is when people try to create the same temp table (in Oracle), and I don’t think we want to create all possible temp tables beforehand (which is what Oracle seems to want), because we have so many different flavors of them.

I agree we want to keep the temp tables, as @Chris_Knoll, and @Vojtech_Huser have argued. To make them workable in Oracle we’ll need to make their names unique. We can have translateSql() do this for us using a random generated ‘session ID’, but we need to know the correct scope to use a single session ID (assuming we want to use connection pooling, which I thing we really do).

I like the suggestion of @Chris_Knoll to require users to instantiate the SqlTranslate class. Another option would be to make the whole approach explicit: Before calling translateSql(), you create a session Id (we could add a function for that to SqlRender), and pass it to translateSql() to be used in Oracle. If no session ID is provided, the default behavior is to generate one ID per JVM (i.e. populate a static variable).

I’ve created a SqlRender branch called unique_oracle_temp_tables, implementing the explicit approach. When using SqlRender from Java, the following code:

String sessionId = SqlTranslate.generateSessionId();
String sql = "CREATE TABLE #covariates (x int);";
sql = SqlTranslate.translateSql(sql, "sql server", "oracle", sessionId);

will produce sql like this:

CREATE GLOBAL TEMPORARY TABLE tmpqibdgfqc0f_covariates  (x int) ON COMMIT PRESERVE ROWS;

The session ID is just a randomly generated string. It is up to you to decide to correct scope to use it in, and when to generate a new session ID.

When the sessionId parameter of translateSql() is set to null, a global session ID will be generated once and used throughout the lifetime of the JVM. This is the current behaviour when using the SqlRender package in R.

Anyone care to give this a try?

t