- Title: A Dataset and Benchmark for Hospital Course Summarization with Adapted Large Language Models
- Authors: Asad Aali, Dave Van Veen, Yamin Ishraq Arefeen, Jason Hom, Christian Bluethgen, Eduardo Pontes Reis, Sergios Gatidis, Namuun Clifford, Joseph Daws, Arash S Tehrani, Jangwon Kim, Akshay S Chaudhari
- Insitute: Stanford University
- Contact: [email protected]
Use these commands to set up a conda environment:
conda env create -f env/environment.yml
conda activate bhc_summ
pip install -r env/requirements.txt
Set model
and case_id
as desired:
training/llama2_peft.ipynb
: fine-tune llama2 using QLoRA.training/train_peft.sh
: fine-tune clin-t5, flan-t5, flan-ul2, falcon models using QLoRA.
Set model
and case_id
as desired:
inference/gpt_inference.ipynb
: generate output from gpt models using discrete or in-context prompting.inference/llama2_inference.ipynb
: generate output from fine-tuned llama models.inference/run_discrete.sh
: generate output from other models via discrete prompting.inference/run_peft.sh
: generate output from other fine-tuned models.inference/calc_metrics.sh
: calculate metrics on outputs.
Download the pre-processed MIMIC-IV-BHC dataset, published on PhysioNet.
utils/mimic_iv_bhc_preprocessing.py
: Script for generating the MIMIC-IV-BHC dataset.
- In
src/constants.py
, set your own project directoryDIR_PROJECT
. - To modify default parameters, create a new
cases
entry insrc/constants.py
. - To add your own dataset, follow the format in
data/
, which contains a subset of chest x-ray reports from Open-i.
@article{aali2024benchmark,
title={A Dataset and Benchmark for Hospital Course Summarization with Adapted Large Language Models},
author={Aali, Asad and Van Veen, Dave and Arefeen, Yamin Ishraq and Hom, Jason and Bluethgen, Christian and Reis, Eduardo Pontes and Gatidis, Sergios and Clifford, Namuun and Daws, Joseph and Tehrani, Arash S and Chaudhari, Akshay S.},
journal={Journal of the American Medical Informatics Association},
year={2024},
publisher={Oxford University Press}
}
@dataset{aali2024mimic,
title={MIMIC-IV-Ext-BHC: Labeled Clinical Notes Dataset for Hospital Course Summarization},
author={Aali, Asad and Van Veen, Dave and Arefeen, Yamin Ishraq and Hom, Jason and Bluethgen, Christian and Reis, Eduardo Pontes and Gatidis, Sergios and Clifford, Namuun and Daws, Joseph and Tehrani, Arash S and Chaudhari, Akshay S.},
year={2024},
publisher={PhysioNet}
}