[Would a Python implementation be friendly to Redshift and Impala?]
Hi all,
I am trying to perform ETL and Data quality check (Achilles/DQD) for data from EHRs. I am not proficient in R and SQL. And I am wondering whether it is worthwhile rewriting OHDSI’s code from R to Python. Especially if I want the Python code to easily scale to Redshift and Impala.
From what I understand, the current R packages seem to serve the purpose of automation, they help us connect to the database, then implement a series of SQL files, then feed the output to the web dashboard.
I also notice that Achilles can work with Impala, and Data Quality Dashboard can work with Redshift. I guess it is because the SQL code is somewhat ‘universal’
In my current Python package, since we have less than 10GB of data, I am actually loading data from PostgreSQL to pandas dataframe, and implement all the table manipulations in dataframe, thus no SQL code in my package.
PRO: (1) I guess that python and pandas can perform more sophisticated mapping logic and quality check than SQL. (2) I am more familiar with Python than SQL
CON: In the long term, will this approach scale when we have ‘very big’ data and have to connect to Redshift or Impala?
Another related question is how ‘big’ is really ‘big data’, and when would we hav to move from PostgreSQL to Redshift or Impala? Some say that PostgreSQL can store up to 1TB of data.
It would be very helpful if someone can give us a limit or some perspective based on your experience.
@jbadger3 : You seem to be working on this issue in your repos. I see that you are using SqlAlchemy. Is this library ‘universal’ to different database backends?
Thank you a lot.
Cheers,
Hung