|
1 | 1 | # MIMIC IV to OMOP CDM Conversion # |
2 | 2 |
|
3 | | -### What is this repository for? ### |
4 | 3 |
|
5 | | -* Quick summary |
6 | | -* Version |
7 | | -* [Learn Markdown](https://bitbucket.org/tutorials/markdowndemo) |
| 4 | +The project implements an ETL conversion of MIMIC IV PhysioNet dataset to OMOP CDM format. |
8 | 5 |
|
9 | | -### Who do I talk to? ### |
| 6 | +* Version 1.0 |
10 | 7 |
|
11 | | -* Repo owner or admin |
12 | | -* Other community or team contact |
13 | 8 |
|
14 | | -### How to run the conversion ### |
| 9 | +### Concepts / Phylosophy ### |
15 | 10 |
|
16 | | -* Workflows: ddl, vocabulary_refresh, staging, etl, ut, qa, unload |
17 | | -* It is supposed that the project root (location of this file) is current directory. |
| 11 | +The ETL is based on the five steps. |
| 12 | +* Create a snapshot of the source data. The snapshot data is stored in staging source tables with prefix "src_". |
| 13 | +* Clean source data: filter out rows to be not used, format values, apply some business rules. Create intermediate tables with prefix "lk_" and postfix "clean". |
| 14 | +* Map distinct source codes to concepts in vocabulary tables. Create intermediate tables with prefix "lk_" and postfix "concept". |
| 15 | + * Custom mapping is implemented in custom concepts generated in vocabulary tables beforehand. |
| 16 | +* Join cleaned data and mapped codes. Create intermediate tables with prefix "lk_" and postfix "mapped". |
| 17 | +* Distribute mapped data by target cdm tables according to target_domain_id values. |
18 | 18 |
|
19 | | -* Run a workflow: |
20 | | - * with local variables: `python scripts/run_workflow.py -c conf/workflow_etl.conf` |
21 | | - * copy "variables" section from file.etlconf |
22 | | - * with global variables: `python scripts/run_workflow.py -e conf/dev.etlconf -c conf/workflow_etl.conf` |
23 | | -* Run explicitly named scripts (space delimited): |
24 | | - `python scripts/run_workflow.py -e conf/dev.etlconf -c conf/workflow_etl.conf etl/etl/cdm_drug_era.sql` |
25 | | -* Run in background: |
26 | | - `nohup python scripts/run_workflow.py -e conf/full.etlconf -c conf/workflow_etl.conf > ../out_full_etl.out &` |
27 | | -* Continue after an error: |
28 | | - `nohup python scripts/run_workflow.py -e conf/full.etlconf -c conf/workflow_etl.conf etl/etl/cdm_observation.sql etl/etl/cdm_observation_period.sql etl/etl/cdm_fact_relationship.sql etl/etl/cdm_condition_era.sql etl/etl/cdm_drug_era.sql etl/etl/cdm_dose_era.sql etl/etl/cdm_cdm_source.sql >> ../out_full_etl.out &` |
| 19 | +Intermediate and staging CDM tables have additional working fields like unit_id. Field unit_id is composed during the ETL steps. From right to left: source table name, initial target table name abbreviation, final target table name or abbreviation. For example: unit_id = 'drug.cond.diagnoses_icd' means that the rows in this unit_id belong to Drug_exposure table, inially were prepared for Condition_occurrence table, and its original is source table diagnoses_icd. |
| 20 | + |
| 21 | +Vocabularies are kept in a separate dataset, and are copied as a part of the snapshot data too. |
| 22 | + |
| 23 | + |
| 24 | +### How to run the conversion ### |
| 25 | + |
| 26 | +* The ETL process encapsulates the following workflows: ddl, vocabulary_refresh, staging, etl, ut, unload. |
| 27 | +* The unload workflow results in creating a final OMOP CDM dataset, which can be analysed with OHDSI tools as Atlas or DQD. |
| 28 | + |
| 29 | +* How to run ETL end-to-end: |
| 30 | + * update config files accordingly |
| 31 | + * perform vocabulary_refresh steps if needed (see vocabulary_refresh/README.md) |
| 32 | + * set the project root (location of this file) as the current directory |
| 33 | + |
| 34 | + ` |
| 35 | + cd vocabulary_refresh |
| 36 | + python vocabulary_refresh.py -s10 |
| 37 | + python vocabulary_refresh.py -s20 |
| 38 | + python vocabulary_refresh.py -s30 |
| 39 | + cd ../ |
| 40 | + python scripts/run_workflow.py -e conf/dev.etlconf -c conf/workflow_ddl.conf |
| 41 | + python scripts/run_workflow.py -e conf/dev.etlconf -c conf/workflow_staging.conf |
| 42 | + python scripts/run_workflow.py -e conf/dev.etlconf -c conf/workflow_etl.conf |
| 43 | + python scripts/run_workflow.py -e conf/dev.etlconf -c conf/workflow_ut.conf |
| 44 | + python scripts/run_workflow.py -e conf/dev.etlconf -c conf/workflow_metrics.conf |
| 45 | + ` |
| 46 | +* How to look at UT and Metrics reports |
| 47 | + * see metrics dataset name in the corresponding .etlconf file |
| 48 | + |
| 49 | + ` |
| 50 | + -- UT report |
| 51 | + SELECT report_starttime, table_id, test_type, field_name |
| 52 | + FROM metrics_dataset.report_unit_test |
| 53 | + WHERE NOT test_passed |
| 54 | + ; |
| 55 | + -- Metrics - row count |
| 56 | + SELECT * FROM metrics_dataset.me_total ORDER BY table_name; |
| 57 | + -- Metrics - person and visit summary |
| 58 | + SELECT |
| 59 | + category, name, count AS row_count |
| 60 | + FROM metrics_dataset.me_persons_visits ORDER BY category, name; |
| 61 | + -- Metrics - Mapping rates |
| 62 | + SELECT |
| 63 | + table_name, concept_field, |
| 64 | + count AS rows_mapped, |
| 65 | + percent AS percent_mapped, |
| 66 | + total AS rows_total |
| 67 | + FROM metrics_dataset.me_mapping_rate |
| 68 | + ORDER BY table_name, concept_field |
| 69 | + ; |
| 70 | + -- Metrics - Top 100 Mapped and Unmapped |
| 71 | + SELECT |
| 72 | + table_name, concept_field, category, source_value, concept_id, concept_name, |
| 73 | + count AS row_count, |
| 74 | + percent AS rows_percent |
| 75 | + FROM metrics_dataset.me_tops_together |
| 76 | + ORDER BY table_name, concept_field, category, count DESC; |
| 77 | + ` |
| 78 | + |
| 79 | +* More option to run ETL parts: |
| 80 | + * Run a workflow: |
| 81 | + * with local variables: `python scripts/run_workflow.py -c conf/workflow_etl.conf` |
| 82 | + * copy "variables" section from file.etlconf |
| 83 | + * with global variables: `python scripts/run_workflow.py -e conf/dev.etlconf -c conf/workflow_etl.conf` |
| 84 | + * Run explicitly named scripts (space delimited): |
| 85 | + `python scripts/run_workflow.py -e conf/dev.etlconf -c conf/workflow_etl.conf etl/etl/cdm_drug_era.sql` |
| 86 | + * Run in background: |
| 87 | + `nohup python scripts/run_workflow.py -e conf/full.etlconf -c conf/workflow_etl.conf > ../out_full_etl.out &` |
| 88 | + * Continue after an error: |
| 89 | + `nohup python scripts/run_workflow.py -e conf/full.etlconf -c conf/workflow_etl.conf etl/etl/cdm_observation.sql etl/etl/cdm_observation_period.sql etl/etl/cdm_fact_relationship.sql etl/etl/cdm_condition_era.sql etl/etl/cdm_drug_era.sql etl/etl/cdm_dose_era.sql etl/etl/cdm_cdm_source.sql >> ../out_full_etl.out &` |
29 | 90 |
|
30 | 91 |
|
31 | 92 | ### Change Log (latest first) ### |
32 | 93 |
|
33 | 94 |
|
| 95 | +**2021-02-08** |
| 96 | + |
| 97 | +* Set version v.1.0 |
| 98 | + |
| 99 | +* Drug_exposure table |
| 100 | + * pharmacy.medication is replacing particular values of prescription.drug |
| 101 | + * source value format is changed to COALESCE(pharmacy.medication.selected, prescription.drug) || prescription.prod_strength |
| 102 | +* Labevents mapping is replaced with new reviewed version |
| 103 | + * vocabulary affected: mimiciv_meas_lab_loinc |
| 104 | + * lk_meas_labevents_clean and lk_meas_labevents_mapped are changed accordingly |
| 105 | +* Unload for Atlas |
| 106 | + * Technical fields unit_id, load_row_id, load_table_id, trace_id are removed from Atlas devoted tables |
| 107 | +* Delivery export script |
| 108 | + * tables are exported to a single directory and single files. If a table is too large, it is exported to multiple files |
| 109 | +* Bugfixes and cleanup |
| 110 | +* Real environmental names are replaced with placeholders |
| 111 | + |
| 112 | + |
34 | 113 | **2021-02-01** |
35 | 114 |
|
36 | | -* Waveforms POC-2 (load from folders tree and csv files) |
| 115 | +* Waveform POC-2 is created for 4 MIMIC III Waveform files uploaded to the bucket |
| 116 | + * iterate through the folders tree, capture metadata and load the CSVs |
37 | 117 | * Bugfixes |
38 | 118 |
|
39 | 119 |
|
|
0 commit comments