I conducted several experiments using the checkpoints released in the paper and observed the following insights and issues:
I. Poor Performance
- The overall performance of the model is very weak, as shown below:
II. Weakness of the Checkpoint Model
-
My experiments suggest that visual information has little to no impact on the model’s predictions.
-
The model seems to rely primarily on the question or textual features instead of the image.
-
I applied adversarial attacks using the downstream loss, but the attacks had little effect, the output was not change.
-
Furthermore, when I provided random noise images and asked the model questions, it produced the same responses as it did with real medical images.

I conducted several experiments using the checkpoints released in the paper and observed the following insights and issues:
I. Poor Performance
II. Weakness of the Checkpoint Model
My experiments suggest that visual information has little to no impact on the model’s predictions.
The model seems to rely primarily on the question or textual features instead of the image.
I applied adversarial attacks using the downstream loss, but the attacks had little effect, the output was not change.
Furthermore, when I provided random noise images and asked the model questions, it produced the same responses as it did with real medical images.