I wanted to summarize the issues we’ve been having with Data support in Impala, so we can hopefully find a solution to fix them.
Impala does not support the DATE type, so in the Common Data Model we used VARCHAR(8) to represent DATE columns. For example, the observation_period_start_date column in observation_period is a VARCHAR(8).
CREATE TABLE observation_period (
observation_period_id INTEGER,
person_id INTEGER,
observation_period_start_date VARCHAR(8), -- DATE
observation_period_end_date VARCHAR(8), -- DATE
period_type_concept_id INTEGER
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
TBLPROPERTIES ("skip.header.line.count"="1")
;
This representation makes it trivial to import data from synpuf, since the dates are formatted as strings of length 8 (YYYYMMDD). However, it makes it hard for Impala to query these fields since it has to parse the timestamps. This was accomplished by doing the parsing in the SqlRender layer.
Unfortunately this was still not sufficient, since every time a DATE field is read it must be cast explicitly (due to Impala’s tendency to require casts), however in Achilles these casts are not always present - and indeed why should they be if you have a DATE field you shouldn’t have to cast it to DATE to use it?
To see a concrete example of this, see query 109 in Achilles:
-{109 IN (@list_of_analysis_ids)}?{
-- 109 Number of persons with continuous observation in each year
-- Note: using temp table instead of nested query because this gives vastly improved performance in Oracle
IF OBJECT_ID('tempdb..#temp_dates', 'U') IS NOT NULL
DROP TABLE #temp_dates;
SELECT DISTINCT
YEAR(observation_period_start_date) AS obs_year,
CAST(CONCAT(CONCAT(CAST(YEAR(observation_period_start_date) AS VARCHAR(4)), '01'), '01') AS DATE) AS obs_year_start,
CAST(CONCAT(CONCAT(CAST(YEAR(observation_period_start_date) AS VARCHAR(4)), '12'), '31') AS DATE) AS obs_year_end
INTO
#temp_dates
FROM @cdm_database_schema.observation_period
;
INSERT INTO @results_database_schema.ACHILLES_results (analysis_id, stratum_1, count_value)
SELECT
109 AS analysis_id,
CAST(obs_year AS VARCHAR(255)) AS stratum_1,
COUNT_BIG(DISTINCT person_id) AS count_value
FROM @cdm_database_schema.observation_period,
#temp_dates
WHERE
observation_period_start_date <= obs_year_start
AND
observation_period_end_date >= obs_year_end
GROUP BY
obs_year
;
The salient part is that observation_period_start_date is a VARCHAR(8), but obs_year_start is a DATE. This only works under Impala if observation_period_start_date is cast to a DATE (which SqlRender turns into a TIMESTAMP).
So how can fix this? I think we need to change the representation of DATE in CDM. Rather than storing as a VARCHAR(8), we should just store as a TIMESTAMP. To implement this we would load into tables as we currently do, but then have a second step where a copy of the tables is made (in a separate database schema), using a CTAS. This would also give us the opportunity to change the underlying format to Parquet, which is preferred generally for efficiency. I’ve started down this route in https://github.com/tomwhite/CommonDataModel/commit/1d4b050d23a764720a7f92d92abe0266eb5110dc, but I haven’t managed to test it yet.