This project is a modern and efficient Image Caption Generator built as an interactive web application using Streamlit. It leverages a state-of-the-art, pre-trained model from the Hugging Face Transformers library to automatically generate descriptive captions for any uploaded image.
The application uses a powerful image-to-text pipeline powered by the ydshieh/vit-gpt2-coco-en model. This model combines a Vision Transformer (ViT) to understand the visual content of the image and a GPT-2 language model to generate a coherent, human-like caption.
The entire application is wrapped in a user-friendly interface created with Streamlit, allowing users to easily upload an image and view the generated caption in real-time.
- Streamlit: For building the interactive web UI.
- Hugging Face Transformers: For accessing the pre-trained ViT-GPT2 model.
- Pillow (PIL): For image processing.
- PyTorch: As the backend framework for the model.
-
Install the required libraries:
pip install streamlit transformers torch Pillow
-
Save the code as a Python file (e.g.,
app.py). -
Run the application from your terminal:
streamlit run app.py
-
Upload an image through the web interface to see the result.