OHDSI Home | Forums | Wiki | Github

Automatically generate SQL code for ETL from RabbitInAHat

I like RabbitInAHat, it is a useful tool for designing ETL plans. It will be more useful that it could also automatically generate SQL code for ETL. Is there any existing effort for doing this?

The ETL Best Practice page says:

Note: after several independent attempts, we have given up on developing the ‘ultimate’ user-friendly ETL tool. It is always the case that tools like that work really well for 80% of the ETL, but for the remaining 20% of the ETL some low-level code needs to be written that is specific to a source database

I am not sure if this ultimate tool covers auto sql code generation. Anyone knows.

In my opinion, it should be possible to allow auto generation of sql code by

  1. defining a ETL syntax so that we can specify our ETL logic using this syntax in Rabbit-In-A-Hat.
  2. defining the mapping between this middle-layer syntax and any RMDBS language

Anyone knows about this?

Im one of those that tried to create the ultimate user-friendly ETL tool and failed.

Just an example to explain why I think it won’t work in RiaH: Many times a source table will have a column structure, like 10 diagnoses code columns per row, that needs to be transformed into a row presentation, so one row per diagnosis code. Currently in RiaH we just draw arrows from each diagnose field to fields like condition_concept_id, and add a comment like ‘Create one row per code’, and the ETL programmer will understand what needs to be done. Adding features to RiaH to fully specify what you mean will make it extremely complicated. RiaH is explicitly designed to facilitate a multi-disciplinary ETL design session, and we don’t want to get into nitty gritty implementation details like these during such a session, or we’ll lose oversight.

Also, sometimes there are operations that are just not easy to think of beforehand. Another simple example is a database where all source provider IDs have a leading ‘P’, like ‘P0000001’, and in our ETL we just want to remove the P to get an integer to populate the provider_id in the database. Do we really want to implement all possible string operations in RiaH?

After having played this game for some time, I and others like me came to the same conclusion: there is already a fully expressive ETL syntax. It is called SQL :wink:

Interesting topic, we were thinking about the same functionality. The translation from RiaH documentation to SQL sometimes is a lot of repeating work that can be automated really well.

We do agree that you can’t put all operations in RiaH, and it is impossible to generate fully satisfying ETL scripts from it. However, it should be possible to generate rudimentary sql scripts that can be (slightly) modified to produce the final ETL scripts.

Has anyone already tried such a simple approach? Was it worth the effort?

An example:
With some additional logic annotations (YEAR, MONTH, DAY in this case) and supplying constants for certain fields (8552 and 0 for race and ethnicity), you can get from this RiaH diagram to the sql beneath.
Another, easier, possibility is to supply all the logic as comments in the SQL scripts.

To

INSERT INTO omopcdm.person
(
	person_id, 
	person_source_value, 
	year_of_birth, 
	month_of_birth, 
	day_of_birth, 
	gender_concept_id, 
	gender_source_value, 
	race_concept_id, 
	ethnicity_concept_id
)
SELECT  
	PatientId, 
	PatientId,
	YEAR(BirthDate), 
	MONTH(BirthDate), 
	DAY(BirthDate),
	Gender
	Gender,		
	8552,
	0
FROM source.data_pat
t