GHOST: Getting to the Bottom of Hallucinations with a Multi-round Consistency Benchmark [WACV 2026]

[Project Page] [arXiv] [PDF] [Supplemental] [Slides] [BibTeX]

Installation

pip install -r requirements.txt

For local VLM models, install VLMEvalKit:

git clone https://github.com/open-compass/VLMEvalKit.git
cd VLMEvalKit && pip install -e .

Quick Start

Evaluate Predictions

python evaluate_ghost.py --pred-path predictions.json

Generate Predictions

Local VLM:

python run_predictions.py \
  --data-path dataset/ghost_full_merged.json \
  --image-dir dataset/images/ \
  --model-name llava_v1.5_7b \
  --model-type vlm \
  --output-path predictions/llava_predictions.json

API Model:

python run_predictions.py \
  --data-path dataset/ghost_full_merged.json \
  --image-dir dataset/images/ \
  --model-name gpt-4o \
  --model-type api \
  --output-path predictions/gpt4o_predictions.json \
  --api-key YOUR_API_KEY

Checkpoint/Resume:

Predictions are automatically saved every 10 questions
If interrupted, rerun the same command to resume
Use --no-resume to start from scratch
Use --checkpoint-every N to change checkpoint frequency

Dataset Format

JSON format with question keys: {image_id}_{object_id}_{question_type}_{pos/neg}

{
  "2406158_obj3_1pos": "A wheels is present in the image.",
  "2406158_obj3_attr1_1pos": "The color of the wheels present in the image is white.",
  "2406158_obj3_rel1_1pos": "The spatial relation between the wheels and man is that the wheels is to the left of the man."
}

Question Types:

Object: 1pos, 1neg, 2neg, ...
Attribute: attr1_1pos, attr1_1neg, ...
Relation: rel1_1pos, rel1_2neg, ...

GhostConsistencyScore Metric

Categories:

Objects GCS: Consistency on object presence questions
Attributes GCS: Consistency on object attribute questions
Relations GCS: Consistency on spatial relation questions

Output Format

Predictions are saved as JSON:

[
  {
    "question_id": "2406158_obj3_1pos",
    "object_id": "2406158_obj3",
    "image": "2406158.jpg",
    "text": "A wheels is present in the image.",
    "label": "yes",
    "model_name": "llava_v1.5_7b",
    "prediction": "true"
  }
]

Repository Structure

ghost-evaluation/
├── dataset/
│   ├── ghost_full_merged.json
│   └── images/
├── ghost_consistency_score.py
├── utils.py
├── evaluate_ghost.py
├── run_predictions.py
├── requirements.txt
├── .gitignore
└── README.md

Library Usage

from evaluate_ghost import evaluate

results = evaluate('predictions.json')
print(f"Objects GCS: {results['objects_gcs']:.2f}%")
print(f"Attributes GCS: {results['attributes_gcs']:.2f}%")
print(f"Relations GCS: {results['relations_gcs']:.2f}%")

API Implementation

To use API models, implement the get_api_prediction() function in run_predictions.py:

def get_api_prediction(model_name: str, image_path: str, prompt: str, api_key: str = None) -> str:
    if model_name == 'gpt-4o':
        # Implement OpenAI API call
        pass
    elif model_name == 'gemini-pro':
        # Implement Google Gemini API call
        pass
    # Add more models as needed

Citation

@article{ghost2024,
  title={GHOST: Getting to the Bottom of Hallucinations with a Multi-round Consistency Benchmark},
  author={[Authors]},
  journal={[Journal/Conference]},
  year={2024}
}

License

[license]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GHOST: Getting to the Bottom of Hallucinations with a Multi-round Consistency Benchmark [WACV 2026]

Installation

Quick Start

Evaluate Predictions

Generate Predictions

Dataset Format

GhostConsistencyScore Metric

Output Format

Repository Structure

Library Usage

API Implementation

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
dataset		dataset
fig		fig
.gitignore		.gitignore
README.md		README.md
evaluate_ghost.py		evaluate_ghost.py
ghost_consistency_score.py		ghost_consistency_score.py
requirements.txt		requirements.txt
run_predictions.py		run_predictions.py
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

GHOST: Getting to the Bottom of Hallucinations with a Multi-round Consistency Benchmark [WACV 2026]

Installation

Quick Start

Evaluate Predictions

Generate Predictions

Dataset Format

GhostConsistencyScore Metric

Output Format

Repository Structure

Library Usage

API Implementation

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages