Hi Vojtech,
With larger conversion our best practice at Odysseus is to break down the ETL into parts to facilitate the joins and logic for determining where the source data will end up in the CDM. I will explain at a high level how this works, and I hope it is helpful.
Pass 1) Staging - for the scoped source data that will be part of the conversion - isolate this data into its own set of tables 1:1 from the data source. This can be in the “stg” schema for example"
Pass 2) Cleansing/Filtering - apply data cleansing rules to the source data where applicable to CDM conventions. In some cases this may need to be performed later for different rules which would be identified in later steps. Filtering the data which is not wanted or not to OMOP convention is also the consideration at this step.
Pass 3) Dimension - in most cases, the dimension table logic is straight forward, concepts can be mapped via case/decode logic, or joined to Athena vocabulary in a simple way to identify most of the concept IDs that will be used
Pass 4) Clinical Event tables - to get the source data to the right CDM tables, we must pass through several steps and this is where you can start building your ETL process to scale appropriately. Depending on the complexity of the source data, there may be more or less of these steps - but I will attempt to break them down here.
Pass 4a) Accumulate source data per expected domain - write rules for each of the source tables (now in stage) that will populate a holding table. “temp” or “work” is a naming convention that can be used for these. So for example, diagnosis data may be present in 2 or 3 different tables, and possibly amongst different fields (claims data may have a billing code, but you may also want to use data from a problem list field, so it depends). These separate rules should be written as separate ETL statements. The ETL is gathering this data into the domain level “temp” table. Repeat this for all of the CDM domains.
Pass 4b) Clean “work” tables - identify additional cleansing rules that need to be applied to the domain level data - naming convention would be “work_clean” or “temp_clean”
Pass 4c) Join the cleansed tables to vocabulary - finally join the tables to your vocabulary to populate concept_ids (for a final CDM, this would include custom mappings). These tables we can call “work_clean_mapped” and so-forth.
Pass 4d) Populate CDM tables - from the “work_clean_mapped” tables - identify final domains and route records to appropriate CDM clinical event tables - you can make this easier by storing the vocabulary domain in the “work_clean_mapped” so you can eliminate this costly join
Pass 4e) Deduplication - When combining from multiple source tables it is possible for there to be overlap in the records - so a deduplication step is needed to eliminate these multiple records of the same event.
Pass 5) Derived CDM tables - there are many shared scripts to use for this available from the community
This process should be iterative with any effort for custom mappings for vocabulary in the source data. I also suggest using a sample of the data - so it is more manageable while developed and testing your ETL code.