Skip to content

Commit e4d55f1

Browse files
prepare to create a v1.0
1 parent 4d954bf commit e4d55f1

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

42 files changed

+2258
-2852
lines changed

README.md

Lines changed: 101 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,39 +1,119 @@
11
# MIMIC IV to OMOP CDM Conversion #
22

3-
### What is this repository for? ###
43

5-
* Quick summary
6-
* Version
7-
* [Learn Markdown](https://bitbucket.org/tutorials/markdowndemo)
4+
The project implements an ETL conversion of MIMIC IV PhysioNet dataset to OMOP CDM format.
85

9-
### Who do I talk to? ###
6+
* Version 1.0
107

11-
* Repo owner or admin
12-
* Other community or team contact
138

14-
### How to run the conversion ###
9+
### Concepts / Phylosophy ###
1510

16-
* Workflows: ddl, vocabulary_refresh, staging, etl, ut, qa, unload
17-
* It is supposed that the project root (location of this file) is current directory.
11+
The ETL is based on the five steps.
12+
* Create a snapshot of the source data. The snapshot data is stored in staging source tables with prefix "src_".
13+
* Clean source data: filter out rows to be not used, format values, apply some business rules. Create intermediate tables with prefix "lk_" and postfix "clean".
14+
* Map distinct source codes to concepts in vocabulary tables. Create intermediate tables with prefix "lk_" and postfix "concept".
15+
* Custom mapping is implemented in custom concepts generated in vocabulary tables beforehand.
16+
* Join cleaned data and mapped codes. Create intermediate tables with prefix "lk_" and postfix "mapped".
17+
* Distribute mapped data by target cdm tables according to target_domain_id values.
1818

19-
* Run a workflow:
20-
* with local variables: `python scripts/run_workflow.py -c conf/workflow_etl.conf`
21-
* copy "variables" section from file.etlconf
22-
* with global variables: `python scripts/run_workflow.py -e conf/dev.etlconf -c conf/workflow_etl.conf`
23-
* Run explicitly named scripts (space delimited):
24-
`python scripts/run_workflow.py -e conf/dev.etlconf -c conf/workflow_etl.conf etl/etl/cdm_drug_era.sql`
25-
* Run in background:
26-
`nohup python scripts/run_workflow.py -e conf/full.etlconf -c conf/workflow_etl.conf > ../out_full_etl.out &`
27-
* Continue after an error:
28-
`nohup python scripts/run_workflow.py -e conf/full.etlconf -c conf/workflow_etl.conf etl/etl/cdm_observation.sql etl/etl/cdm_observation_period.sql etl/etl/cdm_fact_relationship.sql etl/etl/cdm_condition_era.sql etl/etl/cdm_drug_era.sql etl/etl/cdm_dose_era.sql etl/etl/cdm_cdm_source.sql >> ../out_full_etl.out &`
19+
Intermediate and staging CDM tables have additional working fields like unit_id. Field unit_id is composed during the ETL steps. From right to left: source table name, initial target table name abbreviation, final target table name or abbreviation. For example: unit_id = 'drug.cond.diagnoses_icd' means that the rows in this unit_id belong to Drug_exposure table, inially were prepared for Condition_occurrence table, and its original is source table diagnoses_icd.
20+
21+
Vocabularies are kept in a separate dataset, and are copied as a part of the snapshot data too.
22+
23+
24+
### How to run the conversion ###
25+
26+
* The ETL process encapsulates the following workflows: ddl, vocabulary_refresh, staging, etl, ut, unload.
27+
* The unload workflow results in creating a final OMOP CDM dataset, which can be analysed with OHDSI tools as Atlas or DQD.
28+
29+
* How to run ETL end-to-end:
30+
* update config files accordingly
31+
* perform vocabulary_refresh steps if needed (see vocabulary_refresh/README.md)
32+
* set the project root (location of this file) as the current directory
33+
34+
`
35+
cd vocabulary_refresh
36+
python vocabulary_refresh.py -s10
37+
python vocabulary_refresh.py -s20
38+
python vocabulary_refresh.py -s30
39+
cd ../
40+
python scripts/run_workflow.py -e conf/dev.etlconf -c conf/workflow_ddl.conf
41+
python scripts/run_workflow.py -e conf/dev.etlconf -c conf/workflow_staging.conf
42+
python scripts/run_workflow.py -e conf/dev.etlconf -c conf/workflow_etl.conf
43+
python scripts/run_workflow.py -e conf/dev.etlconf -c conf/workflow_ut.conf
44+
python scripts/run_workflow.py -e conf/dev.etlconf -c conf/workflow_metrics.conf
45+
`
46+
* How to look at UT and Metrics reports
47+
* see metrics dataset name in the corresponding .etlconf file
48+
49+
`
50+
-- UT report
51+
SELECT report_starttime, table_id, test_type, field_name
52+
FROM metrics_dataset.report_unit_test
53+
WHERE NOT test_passed
54+
;
55+
-- Metrics - row count
56+
SELECT * FROM metrics_dataset.me_total ORDER BY table_name;
57+
-- Metrics - person and visit summary
58+
SELECT
59+
category, name, count AS row_count
60+
FROM metrics_dataset.me_persons_visits ORDER BY category, name;
61+
-- Metrics - Mapping rates
62+
SELECT
63+
table_name, concept_field,
64+
count AS rows_mapped,
65+
percent AS percent_mapped,
66+
total AS rows_total
67+
FROM metrics_dataset.me_mapping_rate
68+
ORDER BY table_name, concept_field
69+
;
70+
-- Metrics - Top 100 Mapped and Unmapped
71+
SELECT
72+
table_name, concept_field, category, source_value, concept_id, concept_name,
73+
count AS row_count,
74+
percent AS rows_percent
75+
FROM metrics_dataset.me_tops_together
76+
ORDER BY table_name, concept_field, category, count DESC;
77+
`
78+
79+
* More option to run ETL parts:
80+
* Run a workflow:
81+
* with local variables: `python scripts/run_workflow.py -c conf/workflow_etl.conf`
82+
* copy "variables" section from file.etlconf
83+
* with global variables: `python scripts/run_workflow.py -e conf/dev.etlconf -c conf/workflow_etl.conf`
84+
* Run explicitly named scripts (space delimited):
85+
`python scripts/run_workflow.py -e conf/dev.etlconf -c conf/workflow_etl.conf etl/etl/cdm_drug_era.sql`
86+
* Run in background:
87+
`nohup python scripts/run_workflow.py -e conf/full.etlconf -c conf/workflow_etl.conf > ../out_full_etl.out &`
88+
* Continue after an error:
89+
`nohup python scripts/run_workflow.py -e conf/full.etlconf -c conf/workflow_etl.conf etl/etl/cdm_observation.sql etl/etl/cdm_observation_period.sql etl/etl/cdm_fact_relationship.sql etl/etl/cdm_condition_era.sql etl/etl/cdm_drug_era.sql etl/etl/cdm_dose_era.sql etl/etl/cdm_cdm_source.sql >> ../out_full_etl.out &`
2990

3091

3192
### Change Log (latest first) ###
3293

3394

95+
**2021-02-08**
96+
97+
* Set version v.1.0
98+
99+
* Drug_exposure table
100+
* pharmacy.medication is replacing particular values of prescription.drug
101+
* source value format is changed to COALESCE(pharmacy.medication.selected, prescription.drug) || prescription.prod_strength
102+
* Labevents mapping is replaced with new reviewed version
103+
* vocabulary affected: mimiciv_meas_lab_loinc
104+
* lk_meas_labevents_clean and lk_meas_labevents_mapped are changed accordingly
105+
* Unload for Atlas
106+
* Technical fields unit_id, load_row_id, load_table_id, trace_id are removed from Atlas devoted tables
107+
* Delivery export script
108+
* tables are exported to a single directory and single files. If a table is too large, it is exported to multiple files
109+
* Bugfixes and cleanup
110+
* Real environmental names are replaced with placeholders
111+
112+
34113
**2021-02-01**
35114

36-
* Waveforms POC-2 (load from folders tree and csv files)
115+
* Waveform POC-2 is created for 4 MIMIC III Waveform files uploaded to the bucket
116+
* iterate through the folders tree, capture metadata and load the CSVs
37117
* Bugfixes
38118

39119

conf/dev.etlconf

Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -3,28 +3,28 @@
33

44
"variables":
55
{
6-
"@source_project": "physionet-data",
7-
"@core_dataset": "mimic_demo_core",
8-
"@hosp_dataset": "mimic_demo_hosp",
9-
"@icu_dataset": "mimic_demo_icu",
10-
"@ed_dataset": "mimic_demo_ed",
6+
"@source_project": "source_project...",
7+
"@core_dataset": "core...",
8+
"@hosp_dataset": "hosp...",
9+
"@icu_dataset": "icu...",
10+
"@ed_dataset": "ed...",
1111

12-
"@voc_project": "odysseus-mimic-dev",
13-
"@voc_dataset": "vocabulary_2020_09_11",
12+
"@voc_project": "etl_project...",
13+
"@voc_dataset": "voc...",
1414

15-
"@wf_project": "odysseus-mimic-dev",
16-
"@wf_dataset": "waveform_source_poc",
15+
"@wf_project": "etl_project...",
16+
"@wf_dataset": "wf...",
1717

18-
"@etl_project": "odysseus-mimic-dev",
19-
"@etl_dataset": "mimiciv_demo_cdm_2021_01_20",
18+
"@etl_project": "etl_project...",
19+
"@etl_dataset": "etl...",
2020

21-
"@metrics_project": "odysseus-mimic-dev",
22-
"@metrics_dataset": "mimiciv_demo_metrics_2021_01_20",
21+
"@metrics_project": "etl_project...",
22+
"@metrics_dataset": "metrics...",
2323

24-
"@atlas_project": "odysseus-mimic-dev",
25-
"@atlas_dataset": "mimiciv_demo_202101_cdm_531",
24+
"@atlas_project": "etl_project...",
25+
"@atlas_dataset": "atlas...",
2626

27-
"@waveforms_csv_path": "gs://mimic_iv_to_omop/waveforms/source_data/csv"
27+
"@waveforms_csv_path": "gs://bucket..."
2828

2929
},
3030

conf/full.etlconf

Lines changed: 22 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -3,28 +3,28 @@
33

44
"variables":
55
{
6-
"@source_project": "physionet-data",
7-
"@core_dataset": "mimic_core",
8-
"@hosp_dataset": "mimic_hosp",
9-
"@icu_dataset": "mimic_icu",
10-
"@ed_dataset": "mimic_ed",
6+
"@source_project": "source_project...",
7+
"@core_dataset": "core...",
8+
"@hosp_dataset": "hosp...",
9+
"@icu_dataset": "icu...",
10+
"@ed_dataset": "ed...",
1111

12-
"@voc_project": "odysseus-mimic-dev",
13-
"@voc_dataset": "vocabulary_2020_09_11",
12+
"@voc_project": "etl_project...",
13+
"@voc_dataset": "voc...",
1414

15-
"@wf_project": "odysseus-mimic-dev",
16-
"@wf_dataset": "waveform_source_poc",
15+
"@wf_project": "etl_project...",
16+
"@wf_dataset": "wf...",
1717

18-
"@etl_project": "odysseus-mimic-dev",
19-
"@etl_dataset": "mimiciv_full_cdm_2021_01_31",
18+
"@etl_project": "etl_project...",
19+
"@etl_dataset": "etl...",
2020

21-
"@metrics_project": "odysseus-mimic-dev",
22-
"@metrics_dataset": "mimiciv_full_metrics_2021_01_31",
21+
"@metrics_project": "etl_project...",
22+
"@metrics_dataset": "metrics...",
2323

24-
"@atlas_project": "odysseus-mimic-dev",
25-
"@atlas_dataset": "mimiciv_full_202101_cdm_531",
24+
"@atlas_project": "etl_project...",
25+
"@atlas_dataset": "atlas...",
2626

27-
"@waveforms_csv_path": "gs://mimic_iv_to_omop/waveforms/source_data/csv"
27+
"@waveforms_csv_path": "gs://bucket..."
2828

2929
},
3030

@@ -68,6 +68,12 @@
6868
"conf": "workflow_qa.conf"
6969
},
7070

71+
{
72+
"workflow": "metrics",
73+
"comment": "build metrics with metrics_gen scripts",
74+
"type": "sql",
75+
"conf": "workflow_metrics.conf"
76+
},
7177
{
7278
"workflow": "gen_scripts",
7379
"comment": "automation to generate similar queries for some tasks",

custom_mapping_csv/custom_mapping_list.tsv

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
"file_name" "source_vocabulary_id" "min_concept_id" "max_concept_id" "row_count" "target_domains"
22
"gcpt_mimic_generated.csv" "mimiciv_mimic_generated" 2000000000 2000001000 "all(?)"
3-
"gcpt_meas_lab_loinc.csv" "mimiciv_meas_lab_loinc" 2000001001 2000001173 173 "measurement"
3+
"gcpt_meas_lab_loinc.csv" "mimiciv_meas_lab_loinc" 2000001001 2000001235 235 "measurement"
44
"gcpt_obs_insurance.csv" "mimiciv_obs_insurance" 2000001301 2000001305 5 "observation, Meas Value"
55
"gcpt_per_ethnicity.csv" "mimiciv_per_ethnicity" 2000001401 2000001408 8 "person"
66
"gcpt_obs_marital.csv" "mimiciv_obs_marital" 2000001501 2000001507 7 "observation"

0 commit comments

Comments
 (0)