OHDSI Home | Forums | Wiki | Github

MIMIC2 (and possibly MIMIC3) data in OHDSI CDM format

We would like to announce another dataset available in OHDSI CDM v5 format.
All SQL coding credit goes to Bayzid Shamsuzzoha.

At the URL below

you can find CSV files for “MIMIC2dead” dataset. This dataset is a public subset of MIMIC data (deceased patients) from this original source: http://physionet.org/mimic2/demo/mimic2dead.sql.gz. More description of that data is here http://physionet.org/mimic2/demo/#installing-into-mysql-from-mysql-dump.

The tranformation is a work in progress. For example, we need yet to fully map to CDMV (CDM Vocabulary) CIDs (concept_id’s). We welcome comments (on the forum here or at GitHub) from possible future users (or contributors to the SQL). The SQL code has comments for some of the challenges we found.

MIMIC recently released also MIMIC3 dataset. The same ETL code should also work on MIMIC3 (to be confirmed).

2 Likes

Thanks Vojtech for posting this. I would like to introduce myself, and provide more details on the progress of the MIMIC conversion.

I am Md Shamsuzzoha Bayzid. I am doing my PhD at the University of Texas at Austin under the supervision of Prof. Joydeep Ghosh. Currently I am working with Vojtech to convert MIMIC dataset to CDM v5 format. I would like to provide some details on the progress of our work. Please see the link below to find what have been done so far.

As Vojtech has already mentioned, this implementation is under active development; and we will appreciate your suggestions to complete/improve the ETL implementation.

@MS_Bayzid:

Let me know if you need help with the mapping of the ids.

@Christian_Reich

Thank you very much. I will contact you if I need any help.

Hi @Vojtech_Huser and @MS_Bayzid,

I am working with a collaborator that is interested in helping with the work being done on the ETL for the MIMIC II/III datasets.

I have taken a look at the repository link above and we could potentially contribute there. I just wanted to make sure that we wouldn’t be duplicating work.

Hi everybody.

I’ve just started working with Dr. Perotte at CUMC. Please let us know about the progress.

Adding my name to the list of folks interested in being informed when MIMIC-III transforms are available. Saw the poster on MIMIC-II @ the OHDSI Symposium. Thanks for the contribution.
Michael Kahn (Michael.Kahn@ucdenver.edu)

update on this at this link

Could I ask how is the progress for converting MIMIC3 data to CDM format, are there any code i can use? I want to test my models on MIMIC3 data, but my model need CDM format.

It seems that a private repo for MIMIC-OMOP ETL was recently made public.

see interim code here:

@parisni can probably clarify best…

We didn’t use this ETL but @Richard_Starr has done a MIMIC 3 conversion which we have here at GT. Let us know if interested.

If you like this proposal - please comment on MIT github

2 Likes

I’m in the process of doing the mimicIII ETL using this project. I lost my session in the middle of the actual ETL step, but I think it ran to completion. There’s a post test you can run to check your ETL results, but I can’t tell if the issues it reports are critical or not

[dteuser@dtk8scentos02 mimic-omop]$ psql “$MIMIC” -f “etl/check_etl.sql”
psql:etl/check_etl.sql:1: unrecognized value “” for “ECHO”
Available values are: none, errors, queries, all.
Output format is unaligned.
Pager usage is off.
psql:etl/check_etl.sql:15: etl/StandardizedVocabularies/CONCEPT/check_etl.sql: No such file or directory
psql:etl/check_etl.sql:16: etl/StandardizedVocabularies/COHORT_DEFINITION/check_etl.sql: No such file or directory
psql:etl/check_etl.sql:17: etl/StandardizedVocabularies/ATTRIBUTE_DEFINITION/check_etl.sql: No such file or directory
psql:etl/check_etl.sql:18: etl/StandardizedDerivedElements/COHORT_ATTRIBUTE/check_etl.sql: No such file or directory
psql:etl/check_etl.sql:19: etl/StandardizedHealthSystemDataTables/CARE_SITE/check_etl.sql: No such file or directory
1…4
ok 1 - PERSON – number patients match
ok 2 - PERSON – gender distribution matches
ok 3 - PERSON – date of birth year distributions match
ok 4 - PERSON – no births after deaths
1…1
not ok 1 - number of unique patients who die in the database

Failed test 1: “number of unique patients who die in the database”

Results differ beginning at row 1:

have: (15759)

want: (14849)

Looks like you failed 1 test of 1

1…7
ok 1 - Visit_occurrence table – same number admission
ok 2 - number of hospital admissions who die in-hospital match
ok 3 - Visit_occurrence table – same distribution adm
ok 4 - Visit_occurrence table – distribution admit source value
ok 5 - Visit_occurrence table – repartition discharge_to_source_value
ok 6 - Visit_occurrence table – links checker
ok 7 - Visit_occurrence table – start_date > end_date
1…1
ok 1 - OBSERVATION_PERIOD – row count matches visit_occurrence
1…7
not ok 1 - VISIT_DETAIL – test same number transfers

Failed test 1: “VISIT_DETAIL – test same number transfers”

Results differ beginning at row 1:

have: NULL

want: (184142,19)

ok 2 - VISIT_DETAIL – test visit_source_value and visit_source_concept_id match
not ok 3 - VISIT_DETAIL – test admitting_source_concept_id and admitting_source_value match

Failed test 3: “VISIT_DETAIL – test admitting_source_concept_id and admitting_source_value match”

Results differ beginning at row 1:

have: (0)

want: (139516)

not ok 4 - VISIT_DETAIL – test patients number in visit_detail/icustays

Failed test 4: “VISIT_DETAIL – test patients number in visit_detail/icustays”

Results differ beginning at row 1:

have: (0,0)

want: (46476,57786)

not ok 5 - VISIT_DETAIL – check start_datetime < end_datetime

Failed test 5: “VISIT_DETAIL – check start_datetime < end_datetime”

Number of columns or their types differ between the queries:

have: (161)

want: (0)

not ok 6 - VISIT_DETAIL – check start_date < end_date

Failed test 6: “VISIT_DETAIL – check start_date < end_date”

Number of columns or their types differ between the queries:

have: (35)

want: (0)

ok 7 - VISIT_DETAIL – check care site is never null

Looks like you failed 5 tests of 7

1…7
not ok 1 - MEASUREMENT – check distribution of all labs match

Failed test 1: “MEASUREMENT – check distribution of all labs match”

Results differ beginning at row 168:

have: (33,3,49,219)

want: (33,3,50,219)

psql:etl/StandardizedClinicalDataTables/MEASUREMENT/check_etl.sql:67: ERROR: syntax error at or near “AS”
LINE 7: AS org_name
^
QUERY:
SELECT org_name::TEXT, count(1)
FROM
(
SELECT DISTINCT ON
(hadm_id, spec_type_desc, org_name, coalesce(charttime, chartdate))
AS org_name
FROM microbiologyevents
WHERE org_name IS NOT NULL
) tmp
GROUP BY org_name ORDER BY 2, 1 desc;

CONTEXT: PL/pgSQL function results_eq(text,text,text) line 7 at OPEN
ok 2 - MEASUREMENT – there is source concept in measurement not described
ok 3 - MEASUREMENT – check for duplicate primary keys
ok 4 - MEASUREMENT – standard concept checker
not ok 5 - MEASUREMENT – check row counts match

Failed test 5: “MEASUREMENT – check row counts match”

Results differ beginning at row 1:

have: (768,3097)

want: (1,6330)

ok 6 - Measurement pass, w00t!

Looks like you planned 7 tests but ran 6

1…5
ok 1 - PROCEDURE_OCCURRENCE – check all procedureevents_mv rows inserted
ok 2 - PROCEDURE_OCCURRENCE – check label is consistent with source_value
ok 3 - PROCEDURE_OCCURRENCE – check all CPT code rows inserted
ok 4 - PROCEDURE_OCCURRENCE – check CPT subsections mapped correctly
ok 5 - PROCEDURE_OCCURRENCE – check ICD procedure rows inserted
1…1
ok 1 - PROVIDER – check caregivers/providers match
1…8
ok 1 - CONDITION OCCURRENCE – check ICD diagnoses row count matches
ok 2 - CONDITION OCCURRENCE – diagnosis in admission same
ok 3 - CONDITION OCCURRENCE – distrib diagnosis the same
ok 4 - CONDITION OCCURRENCE – there is source concept in measurement not described
ok 5 - CONDITION OCCURRENCE – primary key checker
ok 6 - CONDITION OCCURRENCE – standard concept checker
not ok 7 - CONDITION OCCURRENCE – start_datetime should be > end_datetime

Failed test 7: “CONDITION OCCURRENCE – start_datetime should be > end_datetime”

Results differ beginning at row 1:

have: (585)

want: (0)

ok 8 - CONDITION OCCURRENCE – start_date should be > end_date

Looks like you failed 1 test of 8

1…10
not ok 1 - OBSERVATION – religion distribution matches (concept 4052017)

Failed test 1: “OBSERVATION – religion distribution matches (concept 4052017)”

Results differ beginning at row 15:

have: (OTHER,2696)

want: (JEWISH,5314)

ok 2 - OBSERVATION – language distribution matches (concept 40758030)
ok 3 - OBSERVATION – marital distribution matches (concept 40766231)
ok 4 - OBSERVATION – insurance distribution matches (concept 46235654)
ok 5 - OBSERVATION – ethnicity distribution matches (concept 44803968)
ok 6 - OBSERVATION – source concept described
ok 7 - OBSERVATION – primary key is always unique
not ok 8 - OBSERVATION – datetimeevents number

Failed test 8: “OBSERVATION – datetimeevents number”

Results differ beginning at row 1:

have: (0)

want: (4485342)

not ok 9 - OBSERVATION – Standard concept checker

Failed test 9: “OBSERVATION – Standard concept checker”

Results differ beginning at row 1:

have: (79027)

want: (0)

Looks like you planned 10 tests but ran 9

1…4
ok 1 - DRUG_EXPOSURE – check number of patients with prescription matches
not ok 2 - DRUG_EXPOSURE – check drug_source_value matches source

Failed test 2: “DRUG_EXPOSURE – check drug_source_value matches source”

Results differ beginning at row 1:

have: (76329301205,1)

want: ("zz ",1)

ok 3 - DRUG_EXPOSURE – is concept source id full filled
ok 4 - DRUG_EXPOSURE – Standard concept checker

Looks like you failed 1 test of 4

1…2
ok 1 - visit occurrence nb
ok 2 - check radio nb
1…3
ok 1 - primary key checker
ok 2 - source concept described
ok 3 - source concept described
ok 4 - Note Nlp pass, w00t!

Looks like you planned 3 tests but ran 4

[dteuser@dtk8scentos02 mimic-omop]$

I’ve loaded the MIMIC III data into OMOP using MIT-LCP/mimic-omop.

Any idea how long it might take to run the Achilles R job on the full MIMIC III dataset? It’s been stuck at 99% for hours now. pgAdmin dashboard indicates a fair bit of activity.

> connectionDetails<- createConnectionDetails(dbms=“postgresql”, user = “dteuser”, password = “XXXXX”, server = “DTK8SCENTOS02.qa.navinet.local/mimic”) > save.image("~/mimic_achilles.RData") > achilles(connectionDetails,cdmDatabaseSchema=“omop”,sourceName=“MIMICIII CDM V5 Database”, cdmVersion = “5.3.0”, runHeel = FALSE, conceptHierarchy = FALSE, numThreads = 10) Connecting using PostgreSQL driver Connecting using PostgreSQL driver |===================================================================================| 100% Executing SQL took 0.147 secs Dropping scratch Achilles tables from schema omop Connecting using PostgreSQL driver |===================================================================================| 100% |===================================================================================| 100% Temporary Achilles tables removed from schema omop Executing multiple queries. This could take a while |===================================================================================| 99%

In my MIMIC-OMOP pgTAB results, it reported just like same as yours. In my opinion, this might be caused by MIMIC 3 conversion ETL queries which provided from the MIT github. these reported issues is not your fault .
Achilles R job takes so long time when it processes MIMIC III dataset. I had spend the time for a week to execute the achilles queries on the RAID - SAS drive server. Check that Achilles R process is still running properly with the pg_activities table in a postgresql database or the CPU/RAM usages monitor.

Indexes are very important as well, before starting any processing, apply the recommended indexes on your data, update statistics to get the best query performance, etc.

Here’s what my dashboard looks like. It appears its like it’s still running, but I lost my R session. I was using R studio web. In the future I’ll use the R shell with screen, so I don’t get disconnected. I did create the indexes. Can I update the stats while it’s running, or will that make things worse?

it does not seem to affect the database performance if does not request too much frequently. In addition, you may wait for a long time even if you create the indexes.

The R process seems to have terminated, so I thought it completed. But I started up Atlas and I can’t see any of the data source reports. Does that indicate that Achilles failed? Where would I look to check? I guess I can generate the SQL and run it in smaller chunks?

I’m getting a PostgreSQL error from the achilles job. I’m using postgres 10, is that a problem? I’ll attach the error report. I don’t see what’s wrong, but I’m on a postgres expert.

DBMS:
postgresql

Error:
org.postgresql.util.PSQLException: ERROR: syntax error at or near “)”
Position: 377

SQL:
CREATE TEMP TABLE tempResults

AS
SELECT
*
FROM
(
select
t2.count_value*100.0/all_count as statistic_value,
CAST(‘Meas:NoNumValue:Percentage’ AS VARCHAR(100)) as measure_id
from
(
select sum(count_value) as all_count from omop.achilles_results where analysis_id = 1820
) t1 join (select count_value from omop.achilles_results where analysis_id = 1821) t2
) t3

R version:
R version 3.5.1 (2018-07-02)

Platform:
x86_64-redhat-linux-gnu

Attached base packages:

  • stats
  • graphics
  • grDevices
  • utils
  • datasets
  • methods
  • base

Other attached packages:

  • Achilles (1.6.3)
  • DatabaseConnector (2.2.1)
  • SqlRender (1.5.3)
  • usethis (1.4.0)
  • devtools (2.0.1)
t