ETL Development using C# - Data Lookup Optimizations

msalmank · January 9, 2017, 2:35pm

Hi,

We are developing an ETL application using C# to parse data from a source data model into OMOPv5. Although the application has been optimized to work with large data, the one bottleneck that we’re facing is performing lookup of certain ID values based off of source value.

Specifically, the Visit Occurrence ID is being auto-generated by the ETL and the source value is stored in the source field. So when we’re parsing Conditions, Procedures, and Drug Exposures, for every record, we lookup the source value and get the auto-generated visit_occurrence_id against it from the visit_occurrence table.

This works fine, but since the ETL is making a roundtrip to the database for every record processed, the processing time of each record is increased. Since we’re working with millions of records, any time that can be saved can be a welcome change. So we’re looking at ways to optimize this.

Given that there can be millions of records in the visit_occurrence table, loading the visit_occurrence_id and visit_source_value into a dictionary or keyvaluepair in Memory can lead to potential system out of memory exceptions. We’re considering loading the records into a dictionary on the disk on initialization, and then reading from that to speed up the process, but that could preset its own problems.

Has anyone here encountered anything similar? What optimizations would you recommend?

Best Regards.