This is the implementation of “Beyond Static Knowledge: Dynamic Context-Aware Cross-Modal Contrastive Learning for Medical Visual Question Answering”, published in IEEE Transactions on Medical Imaging (IEEE TMI).
Medical Visual Question Answering (Med-VQA) aims to analyze medical images and accurately respond to natural language queries, thereby optimizing clinical workflows and improving diagnostic and therapeutic outcomes. Although medical images contain rich visual information, the corresponding textual queries frequently lack sufficient descriptive content. This imbalance of information and modality differences leads to significant semantic bias. Furthermore, existing approaches integrate external medical knowledge to enhance model performance, they primarily rely on static knowledge that lacks dynamic adaptation to specific input samples, leading to redundant information and noise interference. To address these challenges, we propose a Contextual Knowledge-Aware Dynamic Perception for the Cross-Modal Reasoning and Alignment (CKRA) Model. To mitigate knowledge redundancy, CKRA employs a dynamic perception mechanism that leverages semantic cues from the query to selectively filter relevant medical knowledge specific to the current sample’s context. To alleviate cross-modal semantic bias, CKRA bridges the distance between visual and linguistic features through knowledge-image contrastive learning, optimizing knowledge feature representation and directing the model’s attention to key image regions. Further, we design a dual-stream guided attention network that facilitates cross-modal interaction and alignment across multiple dimensions. Experimental results show that the proposed CKRA model outperforms the state-of-the-art method on SLAKE and VQA-RAD datasets. In addition, ablation studies validate the effectiveness of each module, while Grad-CAM maps further demonstrate the feasibility of CKRA for medical visual questioning tasks. The source code and weights of the model are available at https://github.com/cloneiq/CKRA-MedVQA.
Run the following command to install the required packages:
conda env create -f environment.yaml # method 1
pip install -r requirements.txt # method 2├── checkpoints
├── data
│ ├── vqa_medvqa_2019_test.arrow
│ ├── ......
├── download
│ ├── checkpoints
│ ├── biobert_v1.1
│ ├── pretrained
│ │ ├── m3ae.ckpt
│ ├── roberta-base
├── m3ae
├── prepro
├── run_scriptsPlease follow here and only use the SLAKE and VQA-RAD datasets.
Download the m3ae pretrained weight and put it in the download/pretrained.
Download the roberta-base and put it in the download/roberta-base.
Download the BioBert and put it in the download/biobert_v1.1.
Download the checkpoints we trained and put it in the download/checkpoints.
# cd this file
bash run_scripts/ckra_train.sh
# cd this file
bash run_scripts/ckra_test.shIf this repository is useful for your research, please cite:
@article{Yang2025CKRA-MedVQA,
title={Beyond Static Knowledge: Dynamic Context-Aware Cross-Modal Contrastive Learning for Medical Visual Question Answering},
author={Rui Yang, Lijun Liu*,Xupeng Feng,Wei Peng, Xiaobing Yang},
journal={IEEE Transactions on Medical Imaging},
year={2025},
publisher={IEEE}
}
@inproceedings{chen2022m3ae,
title={Multi-Modal Masked Autoencoders for Medical Vision-and-Language Pre-Training},
author={Chen, Zhihong and Du, Yuhao and Hu, Jinpeng and Liu, Yang and Li, Guanbin and Wan, Xiang and Chang, Tsung-Hui},
booktitle={International Conference on Medical Image Computing and Computer-Assisted Intervention},
year={2022},
organization={Springer}
}