Automatic ICD coding benchmark based on the MIMIC dataset.
Please check our paper on EMNLP 2022 (demo track): AnEMIC: A Framework for Benchmarking ICD Coding Models
NOTE: 🚧 The repo is under active development. Please see below for available datasets/models.
Automatic diagnosis coding1 in clinical NLP is a task to predict the diagnoses and the procedures during a hospital stay given the summary of the stay (discharge summary). The labels of the task are mostly represented in ICD (international classification of disease) codes which are alpha-numeric codes widely adopted by hospitals in the US. The most popular database used in automatic diagnosis coding is the MIMIC-III dataset, but the preprocessing varies among the literature, and some of them are done incorrectly. Such inconsistency and error make it hard to compare different methods on automatic diagnosis coding and, arguably, results in incorrect evaluations of the methods.
This code repository aims to provide a standardized benchmark of automatic diagnosis coding with the MIMIC-III database. The benchmark encompasses all the procedures of ICD coding: dataset pre-processing, model training/evaluation, and interactive web demo.
We currently provide (items in parentheses are under development):
- Four preset of preprocessed datasets: MIMIC-III full, top-50, full (old), top-50 (old), where we referred to (old) as the version of CAML2.
- ICD coding models: CNN, CAML, MultiResCNN3, DCAN4, TransICD5, Fusion6, (LAAT)
- Interactive demo
Please put the MIMIC-III csv.gz
files (v1.4) under datasets/mimic3/csv/
. You can also create symbolic links pointing to the files.
Please run the following command to generate the MIMIC-III top-50 dataset or generate other versions using the config files in configs/preprocessing
.
$ python run_preprocessing.py --config_path configs/preprocessing/default/mimic3_50.yml
Please run the following command to train, or resume training of, the CAML model on the MIMIC-III top-50 dataset. You can evaluate the model with --test
options and use other config files under configs
.
$ python run.py --config_path configs/caml/caml_mimic3_50.yml # Train
$ python run.py --config_path configs/caml/caml_mimic3_50.yml --test # Test
Training is logged through TensorBoard graph (located in the output dir under results/
).
Also, logging through text files is performed on pre-processing, training, and evaluation. Log files will be located under logs/
.
After you train a model, you can run an interactive demo app of it (CAML on MIMIC-III top-50, for example) by running
$ streamlit run app.py -- --config_path configs/demo/multi_mimic3_50.yml # CAML, MultiResCNN, DCAN, Fusion on MIMIC-III top-50
You can write own config file specifying modules as same as in pre-processing and training
- MIMIC-III full
Model | macro AUC | micro AUC | macro F1 | micro F1 | P@8 | P@15 |
---|---|---|---|---|---|---|
CNN | 0.835±0.001 | 0.974±0.000 | 0.034±0.001 | 0.420±0.006 | 0.619±0.002 | 0.474±0.004 |
CAML | 0.893±0.002 | 0.985±0.000 | 0.056±0.006 | 0.506±0.006 | 0.704±0.001 | 0.555±0.001 |
MultiResCNN | 0.912±0.004 | 0.987±0.000 | 0.078±0.005 | 0.555±0.004 | 0.741±0.002 | 0.589±0.002 |
DCAN | 0.848±0.009 | 0.979±0.001 | 0.066±0.005 | 0.533±0.006 | 0.721±0.001 | 0.573±0.000 |
TransICD | 0.886±0.010 | 0.983±0.002 | 0.058±0.001 | 0.497±0.001 | 0.666±0.000 | 0.524±0.001 |
Fusion | 0.910±0.003 | 0.986±0.000 | 0.081±0.002 | 0.560±0.003 | 0.744±0.002 | 0.589±0.001 |
- MIMIC-III top-50
Model | macro AUC | micro AUC | macro F1 | micro F1 | P@5 |
---|---|---|---|---|---|
CNN | 0.913±0.002 | 0.936±0.002 | 0.627±0.001 | 0.693±0.003 | 0.649±0.001 |
CAML | 0.918±0.000 | 0.942±0.000 | 0.614±0.005 | 0.690±0.001 | 0.661±0.002 |
MultiResCNN | 0.928±0.001 | 0.950±0.000 | 0.652±0.006 | 0.720±0.002 | 0.674±0.001 |
DCAN | 0.934±0.001 | 0.953±0.001 | 0.651±0.010 | 0.724±0.005 | 0.682±0.003 |
TransICD | 0.917±0.002 | 0.939±0.001 | 0.602±0.002 | 0.679±0.001 | 0.643±0.001 |
Fusion | 0.932±0.001 | 0.952±0.000 | 0.664±0.003 | 0.727±0.003 | 0.679±0.001 |
- MIMIC-III full (old)
Model | macro AUC | micro AUC | macro F1 | micro F1 | P@8 | P@15 |
---|---|---|---|---|---|---|
CNN | 0.833±0.003 | 0.974±0.000 | 0.027±0.005 | 0.419±0.006 | 0.612±0.004 | 0.467±0.001 |
CAML | 0.880±0.003 | 0.983±0.000 | 0.057±0.000 | 0.502±0.002 | 0.698±0.002 | 0.548±0.001 |
MultiResCNN | 0.905±0.003 | 0.986±0.000 | 0.076±0.002 | 0.551±0.005 | 0.738±0.003 | 0.586±0.003 |
DCAN | 0.837±0.005 | 0.977±0.001 | 0.063±0.002 | 0.527±0.002 | 0.721±0.001 | 0.572±0.001 |
TransICD | 0.882±0.010 | 0.982±0.001 | 0.059±0.008 | 0.495±0.005 | 0.663±0.007 | 0.521±0.006 |
Fusion | 0.910±0.003 | 0.986±0.000 | 0.076±0.007 | 0.555±0.008 | 0.744±0.003 | 0.588±0.003 |
- MIMIC-III top-50 (old)
Model | macro AUC | micro AUC | macro F1 | micro F1 | P@5 |
---|---|---|---|---|---|
CNN | 0.892±0.003 | 0.920±0.003 | 0.583±0.006 | 0.652±0.008 | 0.627±0.007 |
CAML | 0.865±0.017 | 0.899±0.008 | 0.495±0.035 | 0.593±0.020 | 0.597±0.016 |
MultiResCNN | 0.898±0.006 | 0.928±0.003 | 0.590±0.012 | 0.666±0.013 | 0.638±0.005 |
DCAN | 0.915±0.002 | 0.938±0.001 | 0.614±0.001 | 0.690±0.002 | 0.653±0.004 |
TransICD | 0.895±0.003 | 0.924±0.002 | 0.541±0.010 | 0.637±0.003 | 0.617±0.005 |
Fusion | 0.904±0.002 | 0.930±0.001 | 0.606±0.009 | 0.677±0.003 | 0.640±0.001 |
(in alphabetical order)
- Abheesht Sharma @abheesht17
- Juyong Kim @dalgu90
- Suhas Shanbhogue @SuhasShanbhogue
@InProceeding{juyong2022anemic,
title = {AnEMIC: A Framework for Benchmarking ICD Coding Models},
author = {Kim, Juyong and Sharma, Abheesht and Shanbhogue, Suhas and Ravikumar, Pradeep and Weiss, Jeremy C},
booktitle = {Conference on Empirical Methods in Natural Language Processing (EMNLP), System Demonstrations},
year = {2022},
publisher = {ACL},
url = {https://github.com/dalgu90/icd-coding-benchmark},
}
Footnotes
-
Also referred to as medical coding, clinical coding, or simply ICD coding in other literature. They may have different meanings in detail. ↩
-
Mullenbach, et al., Explainable Prediction of Medical Codes from Clinical Text, NAACL 2018 (paper, code) ↩
-
Li and Yu, ICD Coding from Clinical Text Using Multi-Filter Residual Convolutional Neural Network, AAAI 2020 (paper, code) ↩
-
Ji, et al., Dilated Convolutional Attention Network for Medical Code Assignment from Clinical Text, Clinical NLP Workshop 2020 (paper, code) ↩
-
Biswas, et al., TransICD: Transformer Based Code-wise Attention Model for Explainable ICD Coding, AIME 2021 (paper, code) ↩
-
Luo, et al., Fusion: Towards Automated ICD Coding via Feature Compression, ACL 2020 Findings (paper, code) ↩