Date support in Impala

tomwhite · April 13, 2017, 2:53pm

I wanted to summarize the issues we’ve been having with Data support in Impala, so we can hopefully find a solution to fix them.

Impala does not support the DATE type, so in the Common Data Model we used VARCHAR(8) to represent DATE columns. For example, the observation_period_start_date column in observation_period is a VARCHAR(8).

CREATE TABLE observation_period (
 observation_period_id INTEGER,
 person_id INTEGER,
 observation_period_start_date VARCHAR(8), -- DATE
 observation_period_end_date VARCHAR(8), -- DATE
 period_type_concept_id INTEGER
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
TBLPROPERTIES ("skip.header.line.count"="1")
;

This representation makes it trivial to import data from synpuf, since the dates are formatted as strings of length 8 (YYYYMMDD). However, it makes it hard for Impala to query these fields since it has to parse the timestamps. This was accomplished by doing the parsing in the SqlRender layer.

Unfortunately this was still not sufficient, since every time a DATE field is read it must be cast explicitly (due to Impala’s tendency to require casts), however in Achilles these casts are not always present - and indeed why should they be if you have a DATE field you shouldn’t have to cast it to DATE to use it?

To see a concrete example of this, see query 109 in Achilles:

 -{109 IN (@list_of_analysis_ids)}?{
-- 109	Number of persons with continuous observation in each year
-- Note: using temp table instead of nested query because this gives vastly improved performance in Oracle

IF OBJECT_ID('tempdb..#temp_dates', 'U') IS NOT NULL
	DROP TABLE #temp_dates;

SELECT DISTINCT 
  YEAR(observation_period_start_date) AS obs_year,
  CAST(CONCAT(CONCAT(CAST(YEAR(observation_period_start_date) AS VARCHAR(4)), '01'), '01') AS DATE) AS obs_year_start,
  CAST(CONCAT(CONCAT(CAST(YEAR(observation_period_start_date) AS VARCHAR(4)), '12'), '31') AS DATE) AS obs_year_end
INTO
  #temp_dates
FROM @cdm_database_schema.observation_period
;

INSERT INTO @results_database_schema.ACHILLES_results (analysis_id, stratum_1, count_value)
SELECT 
  109 AS analysis_id,  
	CAST(obs_year AS VARCHAR(255)) AS stratum_1,
	COUNT_BIG(DISTINCT person_id) AS count_value
FROM @cdm_database_schema.observation_period,
	#temp_dates
WHERE  
		observation_period_start_date <= obs_year_start
	AND 
		observation_period_end_date >= obs_year_end
GROUP BY 
	obs_year
;

The salient part is that observation_period_start_date is a VARCHAR(8), but obs_year_start is a DATE. This only works under Impala if observation_period_start_date is cast to a DATE (which SqlRender turns into a TIMESTAMP).

So how can fix this? I think we need to change the representation of DATE in CDM. Rather than storing as a VARCHAR(8), we should just store as a TIMESTAMP. To implement this we would load into tables as we currently do, but then have a second step where a copy of the tables is made (in a separate database schema), using a CTAS. This would also give us the opportunity to change the underlying format to Parquet, which is preferred generally for efficiency. I’ve started down this route in https://github.com/tomwhite/CommonDataModel/commit/1d4b050d23a764720a7f92d92abe0266eb5110dc, but I haven’t managed to test it yet.

admin · April 14, 2017, 11:26am

@tomwhite

I believe the OHDSI vocabulary work group has made the decision to add TIMESTAMP columns to a future release of the CDM. Perhaps someone from that work group can confirm that & provide a timeline for when that change will be implemented?

See this wiki page: http://www.ohdsi.org/web/wiki/doku.php?id=documentation:next_cdm:time

The following code updates can be made after OHDSI adds the new TIMESTAMP columns to the CDM:

Update the Impala CDM DDL to replace the workaround VARCHAR(8) columns with
TIMESTAMP columns (which are supported by Impala).
Update the Achilles code to make use of the new TIMESTAMP columns. It may be necessary to add a new ‘use_timestamps’ parameter to the Achilles code so we can maintain backwards compatibility with the existing DATE columns.

hripcsa · April 14, 2017, 11:54am

CDM version 5.1 incorporated timestamps as *_DATETIME, but they are optional, and DATE Is still required and used by the OHDSI tools.

George

admin · April 15, 2017, 2:51pm

Thanks for that info @hripcsa

@Christian_Reich or @rimma is there any estimate for when CDM version 5.1 will be officially released? Do you think maybe some time in Q2 or Q3?

@tomwhite I suggest you target CDM 5.1 for implementing timestamps in the Impala CDM DDL and associated achilles code updates,

gregk · April 16, 2017, 12:46pm

Here is my 2 cents on a problem.

At the moment, I am not quite sure I understand the dependency on CDM 5.1? Between choosing VARCHAR(8) or TIMESTAMP, TIMESTAMP would be the a better choice as it also able to represent dates anyway.

There are two issues here, as I see it:

Properly format source CDM data so it would correct map/load into Impala TIMESTAMP definitions. In this specific example, we should just regenerate SynPUF data with TIMESTAMP being represented properly. How data is generated in a physical file is up to the OMOP CDM Conversion code, and it can be adjusted to target a certain type of a data store to generate dates in a format that would fit into TIMESTAMP.
Make modifications to SQLRenderer to work with TIMESTAMP properly.

As far Parquet is concerned, since Impala is typically used for analytical queries, using Parquet should really be the best practice. And that would be a great conversation for another thread to have.

As community, we would definitely benefit from a simple “Hadoop for OHDSI” architecture blueprint - e.g. what components are used for CDM conversation, how data should be formatted and stored, load into analytical views etc…, as well as outlining some best practices.

Chris_Knoll · April 16, 2017, 2:44pm

I have a question: since DATE is an ANSI SQL92 standard datatype, shouldn’t we expect Impala to provide support for that type of column? I’m wary of changing the CDM spec to account for a platform-specific limitation.

-Chris

lee_evans · April 16, 2017, 3:05pm

@gregk I agree with the points in your post.

The dependency on cdm 5.1 is it’s the version where the new timestamp columns are being introduced by the vocab work group. They are not being added for impala but impala can make use of them.

TPMoyer · April 16, 2017, 6:09pm

There are two threads here, one is DataTypes for timey-wimey collumns as the CDM is ported to big data platforms, the second is support in Impalla.
It should be noted that both Parquet and ORC support the Date format, however the Cloudera implementation of Parquet does not support dates.
We’ve explored three solutions to the “how to represent dates?” question.

as strings. Here one still needs to make a choice of format, with ISO-8601 compliant solutions having an advantage of being alphabetic orderering == chronologic ordering. This is a feature the standard US mm-dd-yyyy does not have.
Strings are contraindicated by slowness in searching by programs, and cumbersome sql needed to parse them.
yyyyMMdd as integers. This yields fast compare but has the disadvantage of not getting uniform differences between adjoining dates. The difference between Jan 1’st 2017 and Dec 31’st 2016 is 1 day, but the date-as-integer representations (20170101-20161212) = 8889.
Date differencing comes up in frequently, with RWE datatest translation into observation_period being one example.
as timestamps, with the time portion set to 00:00:00.000000 This works, and allows all vendor implementations of parquet.

On the Impalla front:
We’ve had good experience with an alternate way of getting data scientists to be able to play with their data: Zeppelin https://zeppelin.apache.org/ can read parquet from a cluster, or from S3, and allows quick sql interaction.
Two example paragraphs from a notebook might be a useful starting point. The first one invokes spark and gives an initial peek at the data, the second paragraph: %sql gets you into an interactive sql environment. Python, and R, and several others are also available

%
val = “/omop/raw/vendor/dataSetName/instance/tablename”
//val fid=“s3://tmmoj012938/raw/vendor/dataSetName/instance/tableName/tableName_0in0to0.gz.parquet”
val df=spark.read.parquet(fid)
println(df.schema)
println(df.count)
val maxWidth=80
df.show(5,80)
df.createOrReplaceTempView(“myTable”)

%sql
select * from myTable limit 5

t_abdul_basser · April 17, 2017, 6:12pm

@tomwhite Thanks for the summary; it is very helpful.

I agree with the approach you suggest in general but would second @lee_evans suggestion, to wit, that we target CDM v5.1 given the addition of the TIMESTAMP columns.

tomwhite · April 18, 2017, 10:56am

Thanks for all the comments and suggestions. As @hripcsa points out CDH 5.1 still has DATE fields, so the problem remains.

I have created a PR for CDM with the changes I described above: https://github.com/OHDSI/CommonDataModel/pull/46. I’ve also updated the instructions for Achilles to use the new database schema: https://github.com/OHDSI/Achilles/pull/199. Note that no code changes to Achilles are needed.

Finally, I had to revert a change in SqlRender for Impala to get Achilles query 109 to run correctly: https://github.com/OHDSI/SqlRender/pull/67

Vojtech_Huser · April 18, 2017, 5:07pm

I have the same question as Lee:

@Christian_Reich or @rimma is there any estimate for when CDM version 5.1 will be officially released? Do you think maybe some time in Q2 or Q3?

I even made it an issue.

The impala changes as pull request to CDM (by tom white) now target the master branch.

What is the long term strategy for the CDM DDLs. Will branches be the CDM versions? (I don’t see this as ideal - no way to make a pull request for a specific CDM version)