OHDSI Home | Forums | Wiki | Github

Getting rid of .dbo. in our parameterized SQL

I had always figured the .dbo. you have to add to the schema name in SQL Server was just an annoyance introduced by Microsoft just because they can, but as pointed out here, the dbo part is actually the schema name, and what I thought of as the schema name is in fact the database name. (To my defense, every DBMS has a different interpretation of what database, schema, user, and table means). As a consequence, people using a different schema name than dbo are out of luck when using any of our applications.

A solution I would like to suggest is to make dbo part of the schema name parameter. So where we previously had

SELECT * FROM @cdm_schema.dbo.person;

with parameter value cdm_schema = 'myCdmSchema' we will now have

SELECT * FROM @cdm_schema.person;

with parameter value cdm_schema = 'myCdmSchema.dbo'

To transition to this solution, I suggest we follow these steps:

  1. Modify SqlRender’s translateSql() function so it generates a warning when your parameterized SQL still contains.dbo.
  2. Modify all parameterized SQL in our applications to no longer use .dbo. (including ACHILLES, HERCULES, the methods library, and everything else)
  3. After a grace period, remove the translation rule and warnings from SqlRender, meaning that everyone that hasn’t done step 2 will see their code break on any platform other than SQL Server

Any thoughts?

Makes sense and this is actually the way HERMES works today by replacing the schema token with [Database].dbo

Hmmm, one thing I hadn’t thought of: the USE command only accepts the databasename, and we use it a lot. So if I say cdm_schema = 'myCdmDatabase.dbo' I can’t do

USE @cdm_schema;

Furthermore, if I have two schema’s in the same database I’m in trouble. For example, if I have both myCdmDatabase.dbo.person and myCdmDatabase.dbx.person, then what happens if I do

USE myCdmDatabase;
SELECT * FROM person;

@glarimer and @Vojtech_Huser, do you have any suggestions?

Anything in particular you are gaining by using the use command instead of the schema variable?

The most important thing is that we’re currently using USE as a hack to still be able to use temp tables in Oracle. If in SQL Server you say

USE myDb;

CREATE TABLE #myTempTable (...);

this will translate to this Oracle SQL:

ALTER SESSION SET current_schema = myDb;

CREATE GLOBAL TEMPORARY TABLE myTempTable  (...) ON COMMIT PRESERVE ROWS;

where the temp table will be created in the myDb schema on Oracle, and I can make sure I have write privileges in that schema.

Maybe what is required is separate parameters for database and schema so that you can separate the two. Maybe with those 2 paramaters you can make a table prefix that concats the two together. and we can put into our scripts @schema where we need the schema and @database where we need the database. I assume that it’s always the case that we’d want to alias our tables as {database}.{schema].

The only question i have is, does the @schema contain the . at the end or not, so do you include it in the query as select * from @schema.tablename or select @ from @schematablename? I’m thinking that you wouldn’t include the tailing dot in the paramter and just assume that it contains a valid value such that you can do @schema.tablename.

My last thought is that maybe you can parse out the database part of a CDM_SCHEMA parameter by splitting on the ‘.’. You’d then inject the @database parameter implicitly and all of the scripts that need to use USE can use @database.

So, long story short: my opinion is that CDM_SCHEMA should be passed in as {server}.{schema} and in our scripts we refer to it as @CDM_SCHEMA.Tablename (I’m in agreement with you Martijn). This should work for all environments.

-Chris

Ok, let me see if I understand you correctly. There would be two parameters to the script: @database_schema = 'myCdmDb.dbo' and @database = 'myCdmDb' (I’m using the SQL Server lingo here) which you could then use like this in paramterized SQL:

SELECT * FROM @database_schema.person;

USE @database;

You can make life easier for users by deriving the @database parameter from the @database_schema parameter automatically, for example by using the strsplit() function in R:

databaseSchema <- "myCdmDb.dbo"
database <- strsplit(databaseSchema ,"\\.")[[1]][1]
database 
#> myCdmDb

For all other DBMSs, there would be no period in @database_schema, so the @database parameter would have the same value.

I think this should work. Does anyone have (major) objections?

I believe SQL Server is the only database that uses the database.schema
syntax. All other systems use only ‘database’. So, I think @schuemie 's
original proposal, to embed the ‘.dbo’ in the database parameter seems to
make the most sense, since it accommodates SQL Server users to have
whatever database and schemas they want, but also allows Oracle/Postgres to
specify only the database for their needs. I think creating an extra
parameter and selectively ignoring it for select environments is
unnecessary.

(I brought Patrick up to speed, so ignore his previous post)

I’ve implemented the suggested solution in the CohortMethod package in this commit. This should give developers of other packages a good idea how to fix theirs. In summary (using cdmSchema as an example):

  • Rename parameters of the R function using parameterized SQL from cdmSchema to cdmDatabaseSchema
  • Change the help for the function to indicate that for SQL Server the cdmDatabaseSchema should include ‘.dbo’ at the end (or whatever the schema is).
  • Within the R function, add something like this: cdmDatabase <- strsplit(cdmDatabaseSchema ,"\\.")[[1]][1]
  • In the parameterized SQL, rename the @cdm_schema parameter to @cdm_database_schema, and remove any `.dbo’ you can find.
  • In the parameterized SQL, add a parameter @cdm_database, and use that when calling the USE function instead of @cdm_database_schema.
  • Don’t forget to pass the cdmDatabase parameter in R to the @cdm_database parameter in SQL when calling SqlRender.

I think the Java users of the SqlRender Jar are fine, because they were already doing something similar (at least in WebAPI). For now, nothing will change in the SqlRender package (although somewhere in the future we can drop the translation rules for ‘.dbo.’.)

I agree, this sounds like a good strategy to use moving forward.

I will take the action to go back and apply this convention to the
templateSQL for ACHILLES and HERACLES. @Chris_Knoll should bake this
approach into the alpha version of CIRCE that the cohort definition WG is
building. We do not need to revise the treatment pathways analysis, since
collaborators have been able to successfully run the program without this
modification.

Thank you @schuemie in particular as we are probably the one place with both Oracle and DB write restrictions. And thanks @Patrick_Ryan re: the updated Heracles SQL.

I agree with the final solution.
I also created a wiki page for people who want to write parametized SQL and translate using SqlRender package.

Use this link to document best practices
http://www.ohdsi.org/web/wiki/doku.php?id=development:sql

Hi @Vojtech_Huser! We’ve been using the following principle when it comes to documentation:

  • For R packages we use the standards for documenting those, so use in-line documentation (using Roxygen) to document functions which are collected in a package manual, and optionally vignettes (using KNITR) to describe how to use a package.

  • For everything else we use the wiki

For SqlRender there is a both a package manual and a vignette in the [Github repo] (https://github.com/OHDSI/SqlRender). I’ve just added a section on the whole database / schema discussion to the vignette.

Given this change, I am planing to remove from R code the old convention.

The old code tried to work with schema and database. With convention to have .dbo added by user, we can remove some parameters. Is this correct?

achillesHeel <- function (connectionDetails, 
                      cdmDatabaseSchema, 
                      oracleTempSchema = cdmDatabaseSchema,
                      resultsDatabaseSchema = cdmDatabaseSchema,
                      cdmVersion = "4",
                      vocabDatabaseSchema = cdmDatabaseSchema){


  #not needed now
  resultsDatabase <- strsplit(resultsDatabaseSchema ,"\\.")[[1]][1]
  vocabDatabase <- strsplit(vocabDatabaseSchema ,"\\.")[[1]][1]


  
  if (cdmVersion == "4")  {
    heelFile <- "AchillesHeel_v4.sql"
  } else if (cdmVersion == "5") {
    heelFile <- "AchillesHeel_v5.sql"
  } else  {
    stop("Error: Invalid CDM Version number, use 4 or 5")
  }

  #only schema parameters can be kept      
  heelSql <- loadRenderTranslateSql(sqlFilename = heelFile,
                                    packageName = "Achilles",
                                    dbms = connectionDetails$dbms,
                                    oracleTempSchema = oracleTempSchema,
                                    cdm_database_schema = cdmDatabaseSchema,
                                    #results_database = resultsDatabase,
                                    results_database_schema = resultsDatabaseSchema,
                                    #vocab_database = vocabDatabase,
                                    vocab_database_schema = vocabDatabaseSchema
  );

@schuemie , can you confirm?

I can confirm this: we don’t need to split out a ‘database’ value from the ‘database_schema’ parameter. We should assume that the table qualifier is completely stored in '*database_schema" (ie: cdm_database_schema or results_database_schema or vocab_database_schema" and it’s the responsibility of the caller (the person invoking the method) to put the appropriate ‘table qualifier’ in this parameter. The parameters used to store a ‘database’ value should be removed.

t