Back to posts

A Comprehensive Analysis of Image Captioning Models - Evaluating ViT-GPT2, BLIP, and GIT

Benchmarking Vision-Language Models for Automated Image Description Using Quantitative and Qualitative Metrics

A Comprehensive Analysis of Image Captioning Models - Evaluating ViT-GPT2, BLIP, and GIT

This project is a comparative study of Image caption generation model . This experiment aims to provide:

  • A detailed breakdown of the architectures and mechanisms of ViT-GPT2, BLIP, and GIT.
  • A quantitative and qualitative analysis of their performance.
  • Insights into the strengths, limitations, and suitability of each model for real-world applications.

The dataset utilized for this study consists of 600 images, sourced exclusively from open-access platforms to ensure accessibility and reproducibility. Each image was meticulously self-annotated with high-quality captions to create a reliable ground truth for evaluating the models’ performance.

Visit the Project on GitHub View data in Kaggle

Dataset Composition

  1. Animals - Includes various species in diverse settings, such as wildlife, pets, and zoos.
  2. Humans - Depicts people in natural environments, performing activities, and interacting with objects.
  3. Architecture - Captures man-made structures, including buildings, bridges, and urban landscapes.
  4. Natural Formations and Nature - Covers landscapes, forests, mountains, rivers, and other natural scenes.
  5. Everyday Objects - Features commonly found objects, such as tools, household items, and vehicles.

The data that we collected have been uploaded to Kaggle. Please check them out here .


Dataset Challenges

  1. Diversity of Visual Content: Ensuring the dataset captures a wide variety of visual scenes and objects for generalizability.
  2. Annotation Quality: Maintaining consistency in style and accuracy across all annotated captions.
  3. Ambiguity: Handling images with multiple possible interpretations, where different valid captions could describe the same image.

Sample Image and Prompt

  1. Custom Annotation - Man attempting a slam dunk
  2. vit-gpt2 - a woman jumping in the air to catch a frisbee
  3. blip-conditional - a photography of a basketball player jumping to the basket
  4. blip-unconditional - a man jumping in the air with a basketball
  5. git - a young man playing basketball in a gym

Models used for caption generation

The models that were used for caption generation are:

The folder captions in the repository contains all groundtruth captions as well as the model generated captions.

The annotations can be viewed in this sheet or in all_captions.csv file.

We used three metrics for our comparative study:

  • METEOR
  • BLEU-1
  • BLEU-2

Results

The results are calculated in score.ipynb notebook.The table and the graphs obtained from the study is shown below:

ModelMETEORBLEU-1BLEU-2
ViT-GPT20.16440.18450.0816
GIT0.22070.23010.117
BLIP (Conditional)0.24180.23570.1246
BLIP (Unconditional)0.24260.25550.1327

Quantitative Results and Qualitative Analysis

The quantitative results are :

  • BLIP (Unconditional Mode) achieved the highest scores across all metrics (METEOR: 0.2426, BLEU-1: 0.2555, BLEU-2: 0.1327).
  • BLIP (Conditional Mode) closely followed, showing slight improvements in guided captioning (METEOR: 0.2418, BLEU-1: 0.2357, BLEU-2: 0.1246).
  • GIT demonstrated a balanced performance (METEOR: 0.2207, BLEU-1: 0.2301, BLEU-2: 0.1170).
  • ViT-GPT2 performed the weakest, struggling with visual-text alignment (METEOR: 0.1644, BLEU-1: 0.1845, BLEU-2: 0.0816).

The qualitative analysis that we made are:

  • BLIP models generated semantically rich and contextually accurate captions.
    • GIT provided coherent but sometimes generic captions.
    • ViT-GPT2 struggled with misidentification and irrelevant outputs.

Model Strengths and Weakness

The strength are as follows:

  • BLIP’s Dual Mode (Conditional/Unconditional) allowed better flexibility in caption generation.
  • GIT’s unified transformer architecture helped in balancing vision-language processing.
  • ViT-GPT2’s modularity enabled adaptability in vision and text alignment.

The weakness are as follows:

  • BLIP required significant computational resources.
  • GIT lacked interpretability due to its tightly coupled vision-language representation.
  • ViT-GPT2 frequently misidentified objects and actions, leading to less reliable captions.

Evaluation Metrics

- METEOR captured semantic accuracy.
- BLEU-1 and BLEU-2 measured word precision and phrase coherence.
- Other advanced metrics (CIDEr, ROUGE-L, SPICE) were not included, limiting evaluation depth.

Limitations

- Small dataset size (600 images) reduced statistical reliability.
- Lack of advanced evaluation metrics affected a deeper analysis.
- Real-world applicability was not tested, limiting practical insights.

Combined METEOR for models tested


Combined BLEU-1 for models tested


Combined BLEU-2 for models tested


Visit the Project on GitHub View data in Kaggle