Understanding the Role of Multimodal Models in Computer Vision

What is multimodal machine learning?

Just as humans use their five senses to understand and interpret the world, ML was developed to replicate human perception. Hence, a modal could be considered as a ‘sense’ for multimodal machine learning, or MML, to process information. Every modal is unique and can hold very different data, like images, text, speech, or sensor data.

Until recently, ML could only learn and process information from single-modal models. However, since the development of multimodal models, ML could start to process and interpret data from different modalities simultaneously. In contrast, single-modal models, often referred to as unimodal or monomodal models, are designed to work on data from a single modality – for example, image classification or speech recognition.

These single-modal models, albeit effective in their own right, have limitations in their ability to analyse complex real-world data that often comes from multiple sources and modalities. One of the best contemporary examples is GPT-4, which can respond to multimodal queries. OpenAI describes its latest GPT iteration as a multimodal model, which can accept image and textual inputs to produce textual outputs.

The history of multimodality

The term multimodality can be traced back to the 1990s. Since then, the interest in multimodality has grown rapidly in the academic world. Back then, despite the popularity of the term, it was a challenge to describe multimodality. However, Jeff Bezemer & Carey Jewitt described the concept as a ‘means for making meaning’ in Multimodality: A guide for linguists.

This is a meaningful moment as it states that we, humans, use different means of communication to understand and convey information. And, as stated, multimodal models represent different human senses to better perceive and process information. In the data science world, though, multimodal learning is connected to the first Boltzmann machine. Named after Ludwig Boltzmann, a 19th-century physicist and philosopher, the machine is an independent deep learning model or unit in which every node is interconnected with other nodes.

This way, once information is fed to this machine, it can begin to process and determine whether the information has any errors or abnormalities. While this may be an oversimplification, Deep Boltzmann Machines (DBMs) are used in various studies to produce good and reliable multimodal learning models to represent processed information correctly.

How do multimodal models work?

There are many formats of multimodality. In the most common format, the architecture consists of multiple unimodal neural networks. In some cases, the features extracted from different modalities all pass to the same network (like early fusion). The output from these networks is then combined using various fusion techniques which process different data from each model and use that information to provide a more accurate result or prediction.

The initial stage of processing in a multimodal model is known as encoding. In this stage, each input model is processed by its respective unimodal network. For example, in an audiovisual model, one network might process audio data while another processes visual data. Once the information is extracted, it gets integrated or fused. Several fusion techniques — ranging from simple concatenation to attention mechanisms — can be used for this purpose.

The success of these models largely depends on the effectiveness of this multimodal data fusion. Finally, a ‘decision’ network accepts the fused encoded information and is trained on a specific task. This could involve making predictions or decisions based on the joint representation of data generated by the fusion module.

Multimodal fusion techniques

Considering the fact that multimodal ML uses different types of models to come to a single result, implementing the right fusion technique is crucial to maximising the result. The most popular ways of defining data fusion approaches are:

‍Early fusion involves combining the raw data from different modalities into a single input vector which is then fed to the network. This approach requires aligning and pre-processing the data, which can be challenging due to differences in data formats, resolutions and sizes.

‍Late fusion processes each modality separately and then combines their outputs at a later stage. Late fusion can better handle the differences in data formats and modalities but it can also lead to the loss of important information.

‍The intermediate fusion and hybrid fusion approaches may seem similar but they’re very different. The former combines information from different models in varying data processing stages. Hybrid fusion combines elements of both early and late fusion to create a more flexible and adaptable model. It’s the most widely used method, considered far superior to early or late fusion.

It’s important to note that choosing the right fusion model depends on several factors that may warrant the use of a specific fusion approach regardless of its popularity and widespread use. Task complexity, available resources, data characteristics, knowledge, and computational efficiency must be objectively considered to determine the best method.

Single-modal models vs multimodal models

When we compare single-modal models with multimodal models, it's clear that the latter offers a more comprehensive approach to data analysis. Yet, the true benefit lies in the purpose for which single-modal or multimodal models are used. Single-modal models are efficient in processing data from a specific modality but less successful at capturing complete information from real-world data. For example, a model designed for image classification may fail to understand the context or emotion conveyed in an image without accompanying text or audio data.

In contrast, multimodal models integrate information from different modalities which creates a more accurate and informative model. They can capture the underlying structure and relationships between the input data from multiple modalities, enabling the model to make more accurate predictions or generate new outputs based on this multimodal input. Consider the task of emotion recognition.

While a single-modal model might rely solely on facial expressions or voice tone, a multimodal model can combine visual and audio cues to provide a more accurate analysis of a person's emotional state. It’s important to mention that multimodals have higher potential but are also more difficult to train. This is because it’s more difficult for the model to understand the relations between each modal and the representations of the modalities need to have some kind of resemblance (you can't present a modality by a feature vector of 1024 and another one with only 16 information).

Image captioning

Image captioning is an aspect of computer vision and multimodal machine learning which has gained substantial traction in recent years. This technology generates a textual description of an image, combining visual and linguistic interpretation. The process uses complex algorithms that can understand and analyse the content within images and then formulate human-readable sentences that accurately describe what is present in those images.

Simultaneously, multimodal machine learning is the reason why image captioning exists. It uses multimodal machine learning to merge two different data types: images and texts. The goal is to create coherent sentences that not only describe the objects in the image but also capture the overall context.

Text-to-image generation

Text-to-image generation is a popular application of multimodal ML in which models are trained to create images based on textual descriptions. This involves translating text modalities into visual modalities while retaining the semantic meaning of the text. This technology offers significant potential to a wide range of fields, from advertising and entertainment to assisting visually impaired individuals.

Progress in this field has been facilitated by advancements in computer vision and multimodal machine learning. Computer vision has enabled systems to extract meaningful information from visual data while multimodal machine learning allows for the integration of data from different sources or modes, such as text and images, to improve learning outcomes. The fusion of these two technologies provides a robust framework for text-to-image generation.

Visual Question Answering (VQA)

Visual Question Answering (VQA) is a unique result of Natural Language Processing, Computer Vision, and Artificial Intelligence. It represents a significant advancement in both ML and AI. This technology leverages multimodal machine learning techniques to answer questions about the content of digital images. Computer vision, a crucial component of VQA, enables machines to interpret and understand the visual world.

VQA operates by taking an image and a natural language question about the image as input, then providing a natural language answer as output. The process involves feature extraction from the image using computer vision algorithms, processing the input question with language understanding models, combining the information from both sources and generating an appropriate answer.

Emotion recognition

Emotion recognition is a task that can significantly benefit from multimodal learning. By combining visual, textual, and auditory data, multimodal models can provide a more accurate analysis of a person's emotional state compared to single-modal models. Computer vision in emotion recognition uses algorithms and machine learning to analyse facial expressions, body language, tone of voice, and other physiological signals to identify human emotions.

It enables computers to capture, process, and interpret images and videos in real time, mimicking the capabilities of human vision but providing a significantly higher level of accuracy. Integrating computer vision and multimodal machine learning can significantly broaden the scope and application of emotion recognition, potentially benefiting sectors such as healthcare, entertainment, customer service and security.

Key challenges of implementing multimodal models

While multimodal models have immense potential, their implementation comes with a set of unique challenges.

‍Alignment and fusion – this involves ensuring that data from different modalities are synchronised or aligned in time, space or any other relevant dimension. Researchers are currently working on creating modality-invariant representations in multimodal learning. Consequently, when different modalities represent a similar semantic concept, their representations must be similar/close to each other.

‍Translation – a translation is the process of mapping one modality to another. It is about how one modality (for example, textual modality) can be translated into another (for example, visual modality) while maintaining semantic meaning. As a result, translating is an open-ended, subjective process, and there is no perfect answer.

Final remarks

Multimodal models can benefit computer vision to enhance a machine's ability to interpret and understand visual data. This could lead to further advancements in object detection, image recognition, and scene understanding. Multimodal models can also facilitate improved context awareness, which could enable computers to understand and perceive complex environments. However, technology is still in development. Training a single-modal model already requires huge amounts of data, time, and resources to achieve high-quality results, and that is enough for specific tasks. But there's no doubt that multimodality is the next step of advancement.

Understanding the Role of Multimodal
Models in Computer Vision