<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Computer Vision on Biraj Koirala</title><link>https://birajkoirala.com.np/tags/computer-vision/</link><description>Recent content in Computer Vision on Biraj Koirala</description><generator>Source Themes academia (https://sourcethemes.com/academic/)</generator><language>en-us</language><copyright>Copyright &amp;copy; {year}</copyright><lastBuildDate>Tue, 10 Dec 2024 00:00:00 +0000</lastBuildDate><atom:link href="https://birajkoirala.com.np/tags/computer-vision/index.xml" rel="self" type="application/rss+xml"/><item><title>A Comprehensive Analysis of Image Captioning Models - Evaluating ViT-GPT2, BLIP, and GIT</title><link>https://birajkoirala.com.np/post/4.image-caption-analysis/</link><pubDate>Tue, 10 Dec 2024 00:00:00 +0000</pubDate><guid>https://birajkoirala.com.np/post/4.image-caption-analysis/</guid><description>&lt;h2>Table of Contents&lt;/h2>
&lt;nav id="TableOfContents">
&lt;ul>
&lt;li>&lt;a href="#dataset-composition">Dataset Composition&lt;/a>&lt;/li>
&lt;li>&lt;a href="#dataset-challenges">Dataset Challenges&lt;/a>&lt;/li>
&lt;li>&lt;a href="#sample-image-and-prompt">Sample Image and Prompt&lt;/a>&lt;/li>
&lt;li>&lt;a href="#models-used-for-caption-generation">Models used for caption generation&lt;/a>&lt;/li>
&lt;li>&lt;a href="#results">Results&lt;/a>&lt;/li>
&lt;li>&lt;a href="#quantitative-results-and-qualitative-analysis">Quantitative Results and Qualitative Analysis&lt;/a>&lt;/li>
&lt;li>&lt;a href="#model-strengths-and-weakness">Model Strengths and Weakness&lt;/a>&lt;/li>
&lt;li>&lt;a href="#evaluation-metrics">Evaluation Metrics&lt;/a>&lt;/li>
&lt;li>&lt;a href="#limitations">Limitations&lt;/a>&lt;/li>
&lt;li>&lt;a href="#combined-meteor-for-models-tested">Combined METEOR for models tested&lt;/a>&lt;/li>
&lt;li>&lt;a href="#combined-bleu-1-for-models-tested">Combined BLEU-1 for models tested&lt;/a>&lt;/li>
&lt;li>&lt;a href="#combined-bleu-2-for-models-tested">Combined BLEU-2 for models tested&lt;/a>&lt;/li>
&lt;/ul>
&lt;/nav>
&lt;p>This project is a comparative study of Image caption generation model . This experiment aims to provide:&lt;/p>
&lt;ul>
&lt;li>A detailed breakdown of the architectures and mechanisms of ViT-GPT2, BLIP, and GIT.&lt;/li>
&lt;li>A quantitative and qualitative analysis of their performance.&lt;/li>
&lt;li>Insights into the strengths, limitations, and suitability of each model for real-world applications.&lt;/li>
&lt;/ul>
&lt;p>The dataset utilized for this study consists of 600 images, sourced exclusively from open-access platforms to ensure accessibility and reproducibility. Each image was meticulously self-annotated with high-quality captions to create a reliable ground truth for evaluating the models&amp;rsquo; performance.&lt;/p>
&lt;a class="btn btn-primary "
href="https://github.com/biraj094/image-caption-analysis"
target="_blank">Visit the Project on GitHub
&lt;/a>
&lt;a class="btn btn-primary "
href="https://www.kaggle.com/datasets/koiralabiraj/image-annotation/data"
target="_blank">View data in Kaggle
&lt;/a>
&lt;hr>
&lt;h2 id="dataset-composition">Dataset Composition&lt;/h2>
&lt;div class="custom-list ">
&lt;ol style="list-style-type:decimal">
&lt;li>&lt;span style="text-decoration: underline;"> Animals &lt;/span> - Includes various species in diverse settings, such as wildlife, pets, and zoos.&lt;/li>
&lt;li>&lt;span style="text-decoration: underline;"> Humans &lt;/span> - Depicts people in natural environments, performing activities, and interacting with objects.&lt;/li>
&lt;li>&lt;span style="text-decoration: underline;"> Architecture &lt;/span> - Captures man-made structures, including buildings, bridges, and urban landscapes.&lt;/li>
&lt;li>&lt;span style="text-decoration: underline;"> Natural Formations and Nature &lt;/span> - Covers landscapes, forests, mountains, rivers, and other natural scenes.&lt;/li>
&lt;li>&lt;span style="text-decoration: underline;"> Everyday Objects &lt;/span> - Features commonly found objects, such as tools, household items, and vehicles.&lt;/li>
&lt;/ol>
&lt;/div>
&lt;p>The data that we collected have been uploaded to Kaggle. Please check them out &lt;a href="https://www.kaggle.com/datasets/koiralabiraj/image-annotation/data">here&lt;/a>.&lt;/p>
&lt;hr>
&lt;h2 id="dataset-challenges">Dataset Challenges&lt;/h2>
&lt;div class="custom-list ">
&lt;ol style="list-style-type:decimal">
&lt;li>&lt;span style="text-decoration: underline;">Diversity of Visual Content&lt;/span>: Ensuring the dataset captures a wide variety of visual scenes and objects for generalizability.&lt;/li>
&lt;li>&lt;span style="text-decoration: underline;">Annotation Quality&lt;/span>: Maintaining consistency in style and accuracy across all annotated captions.&lt;/li>
&lt;li>&lt;span style="text-decoration: underline;">Ambiguity&lt;/span>: Handling images with multiple possible interpretations, where different valid captions could describe the same image.&lt;/li>
&lt;/ol>
&lt;/div>
&lt;hr>
&lt;h2 id="sample-image-and-prompt">Sample Image and Prompt&lt;/h2>
&lt;figure class="figure">
&lt;img src="cat9_img1.jpeg"
alt=""
title="Sample Image"
style="width: 25%; height: auto;">
&lt;/figure>
&lt;div class="custom-list ">
&lt;ol style="list-style-type:decimal">
&lt;li>&lt;span style="text-decoration: underline;">Custom Annotation&lt;/span> - Man attempting a slam dunk&lt;/li>
&lt;li>&lt;span style="text-decoration: underline;">vit-gpt2&lt;/span> - a woman jumping in the air to catch a frisbee&lt;/li>
&lt;li>&lt;span style="text-decoration: underline;">blip-conditional&lt;/span> - a photography of a basketball player jumping to the basket&lt;/li>
&lt;li>&lt;span style="text-decoration: underline;">blip-unconditional&lt;/span> - a man jumping in the air with a basketball&lt;/li>
&lt;li>&lt;span style="text-decoration: underline;"> git&lt;/span> - a young man playing basketball in a gym&lt;/li>
&lt;/ol>
&lt;/div>
&lt;hr>
&lt;h2 id="models-used-for-caption-generation">Models used for caption generation&lt;/h2>
&lt;p>The models that were used for caption generation are:&lt;/p>
&lt;ul>
&lt;li>&lt;a href="https://huggingface.co/nlpconnect/vit-gpt2-image-captioning">vit-gpt2&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://huggingface.co/Salesforce/blip-image-captioning-base">blip-conditional by Salesforce&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://huggingface.co/Salesforce/blip-image-captioning-base">blip-unconditional by Salesforce&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://huggingface.co/microsoft/git-base">git by Micosoft&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>The folder &lt;!-- raw HTML omitted -->captions&lt;!-- raw HTML omitted --> in the repository contains all groundtruth captions as well as the model generated captions.&lt;/p>
&lt;p>The annotations can be viewed in this &lt;a href="https://docs.google.com/spreadsheets/d/18qtOlw3fx2U0tpsXaBPplqvpL3YJEQoUMMsRXHYoeHU/edit?usp=sharing">sheet&lt;/a> or in &lt;!-- raw HTML omitted -->all_captions.csv&lt;!-- raw HTML omitted --> file.&lt;/p>
&lt;p>We used three metrics for our comparative study:&lt;/p>
&lt;ul>
&lt;li>METEOR&lt;/li>
&lt;li>BLEU-1&lt;/li>
&lt;li>BLEU-2&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h2 id="results">Results&lt;/h2>
&lt;p>The results are calculated in &lt;!-- raw HTML omitted -->score.ipynb&lt;!-- raw HTML omitted --> notebook.The table and the graphs obtained from the study is shown below:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Model&lt;/th>
&lt;th>METEOR&lt;/th>
&lt;th>BLEU-1&lt;/th>
&lt;th>BLEU-2&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>ViT-GPT2&lt;/td>
&lt;td>0.1644&lt;/td>
&lt;td>0.1845&lt;/td>
&lt;td>0.0816&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>GIT&lt;/td>
&lt;td>0.2207&lt;/td>
&lt;td>0.2301&lt;/td>
&lt;td>0.117&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>BLIP (Conditional)&lt;/td>
&lt;td>0.2418&lt;/td>
&lt;td>0.2357&lt;/td>
&lt;td>0.1246&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>BLIP (Unconditional)&lt;/td>
&lt;td>0.2426&lt;/td>
&lt;td>0.2555&lt;/td>
&lt;td>0.1327&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;hr>
&lt;h2 id="quantitative-results-and-qualitative-analysis">Quantitative Results and Qualitative Analysis&lt;/h2>
&lt;p>The quantitative results are :&lt;/p>
&lt;ul>
&lt;li>BLIP (Unconditional Mode) achieved the highest scores across all metrics (METEOR: 0.2426, BLEU-1: 0.2555, BLEU-2: 0.1327).&lt;/li>
&lt;li>BLIP (Conditional Mode) closely followed, showing slight improvements in guided captioning (METEOR: 0.2418, BLEU-1: 0.2357, BLEU-2: 0.1246).&lt;/li>
&lt;li>GIT demonstrated a balanced performance (METEOR: 0.2207, BLEU-1: 0.2301, BLEU-2: 0.1170).&lt;/li>
&lt;li>ViT-GPT2 performed the weakest, struggling with visual-text alignment (METEOR: 0.1644, BLEU-1: 0.1845, BLEU-2: 0.0816).&lt;/li>
&lt;/ul>
&lt;p>The qualitative analysis that we made are:&lt;/p>
&lt;ul>
&lt;li>BLIP models generated semantically rich and contextually accurate captions.
&lt;ul>
&lt;li>GIT provided coherent but sometimes generic captions.&lt;/li>
&lt;li>ViT-GPT2 struggled with misidentification and irrelevant outputs.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h2 id="model-strengths-and-weakness">Model Strengths and Weakness&lt;/h2>
&lt;p>The strength are as follows:&lt;/p>
&lt;ul>
&lt;li>BLIP’s Dual Mode (Conditional/Unconditional) allowed better flexibility in caption generation.&lt;/li>
&lt;li>GIT’s unified transformer architecture helped in balancing vision-language processing.&lt;/li>
&lt;li>ViT-GPT2’s modularity enabled adaptability in vision and text alignment.&lt;/li>
&lt;/ul>
&lt;p>The weakness are as follows:&lt;/p>
&lt;ul>
&lt;li>BLIP required significant computational resources.&lt;/li>
&lt;li>GIT lacked interpretability due to its tightly coupled vision-language representation.&lt;/li>
&lt;li>ViT-GPT2 frequently misidentified objects and actions, leading to less reliable captions.&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h2 id="evaluation-metrics">Evaluation Metrics&lt;/h2>
&lt;pre>&lt;code>- METEOR captured semantic accuracy.
- BLEU-1 and BLEU-2 measured word precision and phrase coherence.
- Other advanced metrics (CIDEr, ROUGE-L, SPICE) were not included, limiting evaluation depth.
&lt;/code>&lt;/pre>
&lt;hr>
&lt;h2 id="limitations">Limitations&lt;/h2>
&lt;pre>&lt;code>- Small dataset size (600 images) reduced statistical reliability.
- Lack of advanced evaluation metrics affected a deeper analysis.
- Real-world applicability was not tested, limiting practical insights.
&lt;/code>&lt;/pre>
&lt;hr>
&lt;h2 id="combined-meteor-for-models-tested">Combined METEOR for models tested&lt;/h2>
&lt;p>&lt;figure class="figure">
&lt;img src="combined-meteor-1.png"
alt=""
title="METEOR1"
style="width: 50%; height: auto;">
&lt;/figure>
&lt;figure class="figure">
&lt;img src="combined-meteor-2.png"
alt=""
title="METEOR2"
style="width: 50%; height: auto;">
&lt;/figure>
&lt;figure class="figure">
&lt;img src="combined-meteor-3.png"
alt=""
title="METEOR3"
style="width: 50%; height: auto;">
&lt;/figure> &lt;/p>
&lt;hr>
&lt;h2 id="combined-bleu-1-for-models-tested">Combined BLEU-1 for models tested&lt;/h2>
&lt;p>&lt;figure class="figure">
&lt;img src="combined-bleu1-1.png"
alt=""
title="BLEU1-1"
style="width: 50%; height: auto;">
&lt;/figure>
&lt;figure class="figure">
&lt;img src="combined-bleu1-2.png"
alt=""
title="BLEU1-2"
style="width: 50%; height: auto;">
&lt;/figure>
&lt;figure class="figure">
&lt;img src="combined-bleu1-3.png"
alt=""
title="BLEU1-3"
style="width: 50%; height: auto;">
&lt;/figure> &lt;/p>
&lt;hr>
&lt;h2 id="combined-bleu-2-for-models-tested">Combined BLEU-2 for models tested&lt;/h2>
&lt;p>&lt;figure class="figure">
&lt;img src="combined-bleu2-1.png"
alt=""
title="BLEU2-1"
style="width: 50%; height: auto;">
&lt;/figure>
&lt;figure class="figure">
&lt;img src="combined-bleu2-2.png"
alt=""
title="BLEU2-2"
style="width: 50%; height: auto;">
&lt;/figure>
&lt;figure class="figure">
&lt;img src="combined-bleu2-3.png"
alt=""
title="BLEU2-3"
style="width: 50%; height: auto;">
&lt;/figure> &lt;/p>
&lt;hr>
&lt;a class="btn btn-primary "
href="https://github.com/biraj094/image-caption-analysis"
target="_blank">Visit the Project on GitHub
&lt;/a>
&lt;a class="btn btn-primary "
href="https://www.kaggle.com/datasets/koiralabiraj/image-annotation/data"
target="_blank">View data in Kaggle
&lt;/a></description></item></channel></rss>