Require suggestions on handling data efficiently

SELVA_MUTHU_KUMARAN · October 21, 2019, 1:22pm

Hello Everyone,

I require inputs from the data analysts here on how can I tackle/address the below scenario?

I have below tables in my database with record count as shown below

I would like to join all these tables and make one final/main table. It’s like X1,X2,X3…Xn are the inputs and Y will be the output label. I know there will be lot of NA’s but that’s fine

I tried the below approaches but nothing helped

Currently we use PostgreSQL db and the join query runs for a long time and doesn’t provide any output at all even after the joining keys are indexed.
I tried joining using Python pandas but didn’t help either. This also keeps running but no output as yet
I thought it would be better to summarize the measurement table to reduce the rows count (but increase the column count). Due to database limitation of 1664 columns, this doesn’t help either because our summarized output will have more than 1664 columns.
I also thought of parallel processing but this one doesn’t help either. I tried with 4 core and 24 core cpu. It was running but didn’t produce any output.

None of the above approaches produced any output. At least not within an hour. But I would like to have this final table in 5 mins time? Is it possible? Can pyspark help? I don’t want to use python Dask(I see this often compared with spark) as it doesn’t have the required functions like pandas unstack.

Any suggestions/easy way to merge/join all these tables into one? If you think Measurement table is causing the issue, the only option I am left with is to sample (stratify) the data?

roger.carlson · October 21, 2019, 2:34pm

The first thing to figure out is why you’re not getting any output. Until you get output, it doesn’t matter how long it takes. Have you tried using a subset of say 1000 records from each table? Are you using outer joins? Have you tried joining first two tables, then adding one at a time to see what the affect that has on the output?

SELVA_MUTHU_KUMARAN · October 21, 2019, 11:13pm

Hi @roger.carlson,

Thanks for the response. Yes, when I tried joining tables with small count, it works fine.

Actually when I summarize my measurement data, the number of columns explodes beyond what is acceptable by postgresql. So this can’t be done in postgresql

But trying to join full measurement table is also taking a long time. Guess it’s expected considering the record count

Actually when it runs beyond 1 hour, it is anyway not useful for me. So I abort it.

Hence , I was looking for ways to make this efficient?

Thanks
Selva

SELVA_MUTHU_KUMARAN · October 22, 2019, 2:50am

In addition, except Measurement table, I am able to get the output for all other tables by 30 seconds. Here is my query. As you can see measurement is the last table that I am joining. Adding that, query is running for a long time

DROP MATERIALIZED VIEW IF EXISTS cdm.dummy CASCADE;
CREATE MATERIALIZED VIEW cdm.dummy as
select
f.person_id,f.gender_concept_id,f.year_of_birth,f.race_concept_id,f.ethnicity_concept_id,
a.visit_occurrence_id,a.visit_concept_id,a.visit_start_datetime,a.visit_end_datetime,a.admitting_source_concept_id,a.discharge_to_concept_id,b.condition_concept_id,b.condition_start_datetime,b.condition_end_datetime,b.condition_type_concept_id,c.measurement_concept_id,c.measurement_datetime,c.value_as_number,c.value_as_concept_id,c.unit_concept_id,d.drug_concept_id,d.drug_exposure_start_datetime,d.drug_exposure_end_datetime,d.drug_type_concept_id,d.quantity,e.procedure_concept_id,e.procedure_datetime,e.procedure_type_concept_id,g.observation_concept_id,g.observation_datetime,g.value_as_number as “observation_value_as_num”, g.value_as_string as “observation_value_as_string”,h.observation_period_start_date,h.observation_period_end_date,
i.death_datetime,i.cause_concept_id
from cdm.visit_occurrence a
left outer join
cdm.condition_occurrence b
on a.person_id = b.person_id and a.visit_occurrence_id = b.visit_occurrence_id
left outer join
cdm.drug_exposure d
on a.person_id = d.person_id and a.visit_occurrence_id = d.visit_occurrence_id
left outer join
cdm.procedure_occurrence e
on a.person_id = e.person_id and a.visit_occurrence_id = e.visit_occurrence_id
left outer join
cdm.person f
on a.person_id = f.person_id
left outer join
cdm.observation g
on a.person_id = g.person_id and a.visit_occurrence_id = g.visit_occurrence_id
left outer join
cdm.observation_period h
on a.person_id = h.person_id
left outer join
cdm.death i
on a.person_id = i.person_id
left outer join
cdm.measurement c
on a.person_id = c.person_id and a.visit_occurrence_id = c.visit_occurrence_id

roger.carlson · October 22, 2019, 12:53pm

OK. Makes sense. Will it work if you limit it to just 1 person? If so, it’s likely a matter of size.

My next question is what exactly you intend to do with the resultset? Creating a table or .csv out of it would be horrendous and hard to work with for analysis needs. Wouldn’t is just be easier to create the query with the filters you need first?

SELVA_MUTHU_KUMARAN · October 24, 2019, 3:47am

@roger.carlson - Thanks for the response. Actually I would like to use these measurements as features for analysis. Yes, handling null values would be next thing but first I have to get this out.