How to push Impala dll v5.3 fixed code

rvilloria · June 4, 2018, 9:47am

Hi everybody,

First of all, I would like to present myself. I’m Rubén Villoria from GMV company and we’re working on HARMONY project (https://www.harmony-alliance.eu/) using CDM and planning to introduce other ODHSI tools. I attended Rotterdam OHDSI Symposium, and I’ve got the opportunity of knowing some of the people involved in this community.

This post is to know how to submit a fixed script for impala dll?
I’ve fixed it introducing some lost code from previous versions
/***
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’
TBLPROPERTIES (“skip.header.line.count”=“1”)
****/

This is necessary to import “base data” and then export to parquet format.

Regards,
Rubén

schuemie · June 6, 2018, 4:51pm

Pulling in @clairblacketer, she should be able to help you

clairblacketer · June 6, 2018, 5:33pm

Hi @rvilloria thank you so much for your help with impala! I am planning a hotfix release this week with updates to the VISIT_DETAIL table so I will be sure to add this in. Should this be the first line of the ddl?

In the future, feel free to open an issue on the CDM github.

Clair

gregk · June 6, 2018, 6:46pm

Hi @clairblacketer, @rvilloria

Clair - back to our conversation on that - I believe we really need to redo the Impala DDL all together. The tables should be created with Impala data types being explicit in CREATE clause, and it should include PARQUET as a clause at the end of CREATE.

CREATE TABLE parquet_table_name (x INT, y STRING, z TIMESTAMP) STORED AS PARQUET;

right now our Impala scripts are doing some really strange manipulations where first tables are created with incorrect TIMESTAMP types (VARCHAR), then data is loaded and then it is converted into correct TIMESTAMP format. It is not needed. I believe it was done due to the fact that SynPUF OMOP CDM sample data file didn’t contain data in the format ready to be loaded into Impala and it was assumed that all data is like that. On a contrary, with proper approach for data to prepared in the format ready to be loaded into the target database.

I propose we create a valid Impala DDL file with TIMESTAMP being applied and PARQUET format specified for storage.

Chris_Knoll · June 7, 2018, 11:25pm

Wasn’t date fields described here?

Is this no longer an issue?

gregk · June 8, 2018, 12:32pm

There no issue with using the TIMESTAMP field any longer and we have OHDSI tools and methods refactored to support IMPALA. We just need to fix how DDL is currently done - it was created based on assumptions that all converted data will look like the sample SynPUF files, which is incorrect.

gregk · June 8, 2018, 12:35pm

@clairblacketer - I asked our ETL / Impala team to compile the Impala DDL so that we can share it with OHDSI and propose changes. It will be coming shortly

clairblacketer · June 8, 2018, 1:03pm

@gregk that will be really helpful, I don’t have an Impala instance to test on so your input is much appreciated. I would like to get to the point where sqlRender can handle this type of conversion so I can just run one script to create all the DDLs as this helps reduce errors. It looks like there were some changes made back in January to the package to fix some of the Impala date functions but I don’t think they address the particular problem you are referring to. Would your team also be willing to take a look at sqlRender and make any necessary additions there as well?

rvilloria · June 9, 2018, 10:44am

Dear all,

Thanks for your responses.

@clairblacketer, @gregk we actually are working with impala, and create the OMOP structure in two phases.
First getting impala dll with varchar(8) , after that we create a scheme with parquet and correct types, timestamp, etc., and finally, we import the information from schema created with dll to parquet doing the transform for varchar to timestamp, and other types of data.

We could share it with you for your review and validation.

Regards

Alexparker · June 18, 2018, 5:40pm

Is it possible to follow the steps and fixed the code properly? I take a help of Apple Support Canada but this site didn’t help me properly. Is it possible refer me to a page related to this topic from this website?

gregk · July 4, 2018, 2:42pm

@rvilloria - I hear you. I think that you are right saying that we need to both scripts. But I would propose that we treat it differently:

One is a proper Impala DDL used to create and populate a new Impala database
Another scripts can be used for a migration of legacy data into a proper OMOP CDM Impala format.

I am attaching a file that contains what I believe to be a valid Impala DDL impala_ddl_53.docx (19.8 KB). Also, please keep in mind that in many cases it will be done by mapping an external parquet file as a external table

CREATE TABLE care_site
(
care_site_id INTEGER,
care_site_name VARCHAR(255),
place_of_service_concept_id INTEGER,
location_id INTEGER,
care_site_source_value VARCHAR(50),
place_of_service_source_value VARCHAR(50)
)
STORED AS PARQUET
LOCATION ‘hdfs://server/omop/care_site’;

gregk · July 4, 2018, 3:36pm

hi,

This confusion seems to be quite common, I would like to summarize the problem at hand again:

There is an Impala schema that needs to be defined, a.k.a. interface, with all proper table, field names, field types, storage format etc… This schema is an interface that is used to then code applications or methods against it - it needs to be valid and represent the data as it should be.
There needs to be a “loader scripts” that would take the data, massage it and load it into that schema (a.k.a. interface) so that it fits it, including the types. Also, to keep in mind that this step is often optional as the data should really be properly formatted for the target schema during the ETL conversion that that Impala views would just be mapped.

This problem is no way different for Impala than for any other data warehouses e.g. Oracle, SQL Server, RedShift etc… - there are loader scripts used for all of them.

So, I am proposing to fix the schema for #1 - see attached. Then if we need to share different loader scripts - that is great too, but it needs to be separated from the clean DDL that defined the interface.