SemiHVision: Enhancing Medical Multimodal Models with a Semi-Human Annotated Dataset and Fine-Tuned Instruction Generation
Multimodal large language models (MLLMs) have made significant strides, yet they face challenges in the medical domain due to limited specialized knowledge. While recent medical MLLMs demonstrate strong performance in lab settings, they often struggle in real-world applications, highlighting a substantial gap between research and practice. In this paper, we seek to address this gap at various stages of the end-to-end learning pipeline, including data collection, model fine-tuning, and evaluation. At the data collection stage, we introduce SemiHVision, a dataset that combines human annotations with automated augmentation techniques to improve both medical knowledge representation and diagnostic reasoning. For model fine-tuning, we trained PMC-Cambrian-8B-AN over 2400 H100 GPU hours, resulting in performance that surpasses public medical models like HuatuoGPT-Vision-34B (79.0% vs. 66.7%) and private general models like Claude3-Opus (55.7%) on traditional benchmarks such as SLAKE and VQA-RAD. In the evaluation phase, we observed that traditional benchmarks cannot accurately reflect realistic clinical task capabilities. To overcome this limitation and provide more targeted guidance for model evaluation, we introduce the JAMA Clinical Challenge, a novel benchmark specifically designed to evaluate diagnostic reasoning. On this benchmark, PMC-Cambrian-AN achieves state-of-the-art performance with a GPT-4 score of 1.29, significantly outperforming HuatuoGPT-Vision-34B (1.13) and Claude3-Opus (1.17), demonstrating its superior diagnostic reasoning abilities.
Dataset | Caption Available | Link | License |
---|---|---|---|
Deeplesion | Yes | Link | CC BY 4.0 |
PadChest | Yes | Link | PADCHEST Dataset Research Use Agreement |
Eurorad | Yes | Link | Creative Commons Attribution 4.0 International License |
MIMIC-CXR-JPG | No | Link | PhysioNet Credentialed Health Data License 1.5.0 |
LLD | Yes | Link | LLD-MMRI Agreement |
MAMA-MIA | Yes | Link | CC BY-NC-SA 4.0 |
PMC-VQA | Yes | Link | CC BY-SA |
PMC-Instruct | Yes | - | OpenRAIL |
Quilt | Yes | Link | - |
Radiopaedia | No | Link | Radiopaedia Agreement |
JAMA Clinical Challenge | No | Link | JAMA's Angreement |
LLaVA-Med | Yes | Link | CC BY NC 4.0 |
The dataset provided (referred to as "This Dataset") is constructed using multiple publicly available datasets and is intended solely for academic and technical research by researchers and developers. Any individual or organization (hereinafter referred to as "User") accessing or using this dataset must comply with the following disclaimer:
This Dataset is built from several publicly available datasets. The sources and licenses of these datasets are clearly stated in the accompanying documentation. Users are required to adhere to the relevant licenses, terms of use, and restrictions specified by the original dataset providers. For cases published on the EuroRad website prior to July 6, 2015, these do not fall under the Creative Commons Attribution 4.0 International License. Therefore, Users must obtain direct permission from the author who submitted the case for any use. To avoid complications, we recommend not using cases published before this date, unless you can secure explicit permission from each case submitter for every intended use.
Reasonable efforts have been made to ensure the accuracy, integrity, and completeness of This Dataset. However, the User assumes all risks associated with using the dataset. The providers of This Dataset accept no responsibility for any errors, inaccuracies, or omissions that may arise.
Under no circumstances shall the providers or contributors of This Dataset be liable for any damages or consequences arising from the use or misuse of This Dataset by the User.
Users of This Dataset must comply with all applicable laws, regulations, and ethical standards. The dataset must not be used for illegal purposes, privacy violations, defamation, discrimination, or other unethical activities.
The intellectual property rights of the original image data in This Dataset belong to the respective rights holders of the source datasets. Users must not engage in activities that violate these intellectual property rights.
As a non-profit organization, we promote a collaborative and ethical open-source environment. Should any content within This Dataset infringe upon legitimate rights, please contact us, and we will make every effort to resolve the issue.
By downloading, accessing, or using This Dataset, the User acknowledges that they have read, understood, and agreed to comply with this disclaimer. If the User does not accept any part of this disclaimer, they should refrain from using This Dataset.
@article{wang2024semihvision,
title={SemiHVision: Enhancing Medical Multimodal Models with a Semi-Human Annotated Dataset and Fine-Tuned Instruction Generation},
author={Wang, Junda and Ting, Yujan and Chen, Eric Z and Tran, Hieu and Yu, Hong and Huang, Weijing and Chen, Terrence},
journal={arXiv preprint arXiv:2410.14948},
year={2024}
}