This repository contains code and resources for large language models (LLMs) in token-level clinical named entity recognition (NER) on rare disease datasets, specifically working with the RareDis dataset.
The original RareDis-v1 dataset should have the following structure. Please refer to the authors of the RareDis dataset for access to the raw data. Once you have obtained the data, you can use brat2iob.py to convert the raw data (BRAT format) to IOB format. You can also directly use the converted data.
RareDis-v1
├── dev
├── test
├── train
├── Abetalipoproteinemia.ann
├── Abetalipoproteinemia.txt
└── ...
└── README.md
- brat2iob.py: A script to convert raw data from BRAT format to IOB format.
- rd_ner_eval.py: Evaluation script for the model outputs.
- rd_ner_ft.py: Fine-tuning script for encoder-only models (BERT, etc.) on rare disease data.
- rd_ner_llm.py: A script leveraging large language models (LLMs) for NER on the RareDis dataset.
- rd_ner_llm_finetune.py: Fine-tuning script specifically for LLMs (Llama-2-7b , etc.) on NER tasks.
- rd_ner_openai.py: A script using OpenAI models for NER tasks on RareDis.
- rd_ner_rag.py: A script for implementing Retrieval-Augmented Generation (RAG) with NER tasks on RareDis.
- gpt4-1106-preview-chat_[all]_5shot_result_test: Example output of NER predictions using a 5-shot GPT-4 model.
- utils.py: Utility functions to support the main scripts.
- requirements.txt: A list of Python dependencies required to run the code in this repository.
To run the scripts in this repository, install the required dependencies by running:
pip install -r requirements.txt
You can find an example of model output in gpt4-1106-preview-chat_[all]_5shot_result_test, showcasing predictions from a GPT-4 model with 5-shot learning.
If you find this resource useful, please consider citing our work:
@article{lu2024large,
title={Large Language Models Struggle in Token-Level Clinical Named Entity Recognition},
author={Lu, Qiuhao and Li, Rui and Wen, Andrew and Wang, Jinlian and Wang, Liwei and Liu, Hongfang},
journal={arXiv preprint arXiv:2407.00731},
year={2024}
}