Previous talks at the SCCS Colloquium

Aristotelis Tsoutsanis: Object recognition with LLMs

SCCS Colloquium |


This thesis delves into the development of Vision-Language Models (VLMs) that utilize pre-trained backbones, aiming to make these models more efficient and accessible by reducing the computational resources needed for training. With the rise of Large Language Models (LLMs) in recent years, we have seen remarkable progress in natural language processing, achieving near-human performance on a wide range of tasks. Meanwhile, visual recognition has remained a critical challenge in computer vision, playing a pivotal role in fields like robotics and autonomous driving. Vision-Language Models combine the strengths of visual and textual data, enabling them to tackle complex tasks like image captioning and visual question answering with high accuracy.
In this research, we utilize a two-stage training approach: pre-training and fine-tuning. During pre-training, we focus on transforming image embeddings into the text embedding space using adapters. This process involves minimizing the Earth Mover’s Distance between the image embedding distribution from the image encoder and the text embedding distribution of the LLM to ensure the embeddings align well. In this way, the LLM is not part of the training, significantly lowering computational costs. In the fine-tuning stage, the LLM is brought back into the pipeline, and we use the quantized version of the LLM and we apply Low-Rank Adaptation (LoRA) to fine-tune the model. Meaning that instead of updating the whole weight matrix, we update low-rank matrices that approximate the necessary adjustments. We explore three types of adapters: a simple Multi-Layer Perceptron (MLP) adapter that provides a strong baseline, and two more sophisticated transformer-based adapters that utilize attention mechanisms to enhance performance and alignment between modalities. The first one contains blocks of self-attention and feed-forward directly to the image tokens, while the second one employs learnable queries that learn to selectively extract the most relevant image tokens using self-attention and cross-attention.
Our experiments, conducted on the MSCOCO dataset show that these pre-trained adapters are effective for handling visual-language tasks. However, the fine-tuning phase is essential for refining the model’s accuracy and ability to generate well-structured responses. By omitting the LLM during pre-training, our approach makes it feasible for individuals and smaller organizations to work with multi-modal models, broadening access
to this advanced technology. The pre-training alignment facilitates a smoother and more effective fine-tuning process, leading to faster convergence and better overall performance. Moreover, the Food101 dataset was used for finetuning our pipeline for classification tasks in order to quantify the performance of our architecture.
In summary, this thesis addresses the challenges of scalability and accessibility in vision-language models. We demonstrate that TerraAlign can be trained efficiently for image captioning on the MSCOCO dataset and for classification on the Food101 dataset that shows optimistic results.

Master's thesis presentation. Aristotelis is advised by Mathias Sundholm (PreciTaste), Alexander Dolokov (PreciTaste) and Prof. Dr. Felix Dietrich.