A Comprehensive Analysis of Image Captioning Models - Evaluating ViT-GPT2, BLIP, and GIT

This project is a comparative study of Image caption generation model . This experiment aims to provide:

A detailed breakdown of the architectures and mechanisms of ViT-GPT2, BLIP, and GIT.
A quantitative and qualitative analysis of their performance.
Insights into the strengths, limitations, and suitability of each model for real-world applications.

The dataset utilized for this study consists of 600 images, sourced exclusively from open-access platforms to ensure accessibility and reproducibility. Each image was meticulously self-annotated with high-quality captions to create a reliable ground truth for evaluating the models’ performance.

Visit the Project on GitHub View data in Kaggle

Dataset Composition

Animals - Includes various species in diverse settings, such as wildlife, pets, and zoos.
Humans - Depicts people in natural environments, performing activities, and interacting with objects.
Architecture - Captures man-made structures, including buildings, bridges, and urban landscapes.
Natural Formations and Nature - Covers landscapes, forests, mountains, rivers, and other natural scenes.
Everyday Objects - Features commonly found objects, such as tools, household items, and vehicles.

The data that we collected have been uploaded to Kaggle. Please check them out here .

Dataset Challenges

Diversity of Visual Content: Ensuring the dataset captures a wide variety of visual scenes and objects for generalizability.
Annotation Quality: Maintaining consistency in style and accuracy across all annotated captions.
Ambiguity: Handling images with multiple possible interpretations, where different valid captions could describe the same image.

Sample Image and Prompt

Custom Annotation - Man attempting a slam dunk
vit-gpt2 - a woman jumping in the air to catch a frisbee
blip-conditional - a photography of a basketball player jumping to the basket
blip-unconditional - a man jumping in the air with a basketball
git - a young man playing basketball in a gym

Models used for caption generation

The models that were used for caption generation are:

The folder captions in the repository contains all groundtruth captions as well as the model generated captions.

The annotations can be viewed in this sheet or in all_captions.csv file.

We used three metrics for our comparative study:

METEOR
BLEU-1
BLEU-2

Results

The results are calculated in score.ipynb notebook.The table and the graphs obtained from the study is shown below:

Model	METEOR	BLEU-1	BLEU-2
ViT-GPT2	0.1644	0.1845	0.0816
GIT	0.2207	0.2301	0.117
BLIP (Conditional)	0.2418	0.2357	0.1246
BLIP (Unconditional)	0.2426	0.2555	0.1327

Quantitative Results and Qualitative Analysis

The quantitative results are :

BLIP (Unconditional Mode) achieved the highest scores across all metrics (METEOR: 0.2426, BLEU-1: 0.2555, BLEU-2: 0.1327).
BLIP (Conditional Mode) closely followed, showing slight improvements in guided captioning (METEOR: 0.2418, BLEU-1: 0.2357, BLEU-2: 0.1246).
GIT demonstrated a balanced performance (METEOR: 0.2207, BLEU-1: 0.2301, BLEU-2: 0.1170).
ViT-GPT2 performed the weakest, struggling with visual-text alignment (METEOR: 0.1644, BLEU-1: 0.1845, BLEU-2: 0.0816).

The qualitative analysis that we made are:

BLIP models generated semantically rich and contextually accurate captions.
- GIT provided coherent but sometimes generic captions.
- ViT-GPT2 struggled with misidentification and irrelevant outputs.

Model Strengths and Weakness

The strength are as follows:

BLIP’s Dual Mode (Conditional/Unconditional) allowed better flexibility in caption generation.
GIT’s unified transformer architecture helped in balancing vision-language processing.
ViT-GPT2’s modularity enabled adaptability in vision and text alignment.

The weakness are as follows:

BLIP required significant computational resources.
GIT lacked interpretability due to its tightly coupled vision-language representation.
ViT-GPT2 frequently misidentified objects and actions, leading to less reliable captions.

Evaluation Metrics

- METEOR captured semantic accuracy.
- BLEU-1 and BLEU-2 measured word precision and phrase coherence.
- Other advanced metrics (CIDEr, ROUGE-L, SPICE) were not included, limiting evaluation depth.

Limitations

- Small dataset size (600 images) reduced statistical reliability.
- Lack of advanced evaluation metrics affected a deeper analysis.
- Real-world applicability was not tested, limiting practical insights.

Combined METEOR for models tested

Combined BLEU-1 for models tested

Combined BLEU-2 for models tested

Visit the Project on GitHub View data in Kaggle