Multimodal models are an innovative approach that utilises multiple types of data inputs for processing and analysis. These models are designed to integrate and interpret diverse forms of information to enhance the accuracy and efficiency of machine learning systems. To put it simply, multimodality allows machine learning to process video and text at the same time to predict a single outcome.
With the ability to understand and learn from multiple data streams, these models can provide a more holistic view and deeper understanding of the context. Multimodality can bring us closer to the goal of creating intelligent systems that mimic human perception by integrating various sensory data in computer vision applications.
Just as humans use their five senses to understand and interpret the world, ML was developed to replicate human perception. Hence, a modal could be considered as a ‘sense’ for multimodal machine learning, or MML, to process information. Every modal is unique and can hold very different data, like images, text, speech, or sensor data.
Until recently, ML could only learn and process information from single-modal models. However, since the development of multimodal models, ML could start to process and interpret data from different modalities simultaneously. In contrast, single-modal models, often referred to as unimodal or monomodal models, are designed to work on data from a single modality – for example, image classification or speech recognition.
These single-modal models, albeit effective in their own right, have limitations in their ability to analyse complex real-world data that often comes from multiple sources and modalities. One of the best contemporary examples is GPT-4, which can respond to multimodal queries. OpenAI describes its latest GPT iteration as a multimodal model, which can accept image and textual inputs to produce textual outputs.
The term multimodality can be traced back to the 1990s. Since then, the interest in multimodality has grown rapidly in the academic world. Back then, despite the popularity of the term, it was a challenge to describe multimodality. However, Jeff Bezemer & Carey Jewitt described the concept as a ‘means for making meaning’ in Multimodality: A guide for linguists.
This is a meaningful moment as it states that we, humans, use different means of communication to understand and convey information. And, as stated, multimodal models represent different human senses to better perceive and process information. In the data science world, though, multimodal learning is connected to the first Boltzmann machine. Named after Ludwig Boltzmann, a 19th-century physicist and philosopher, the machine is an independent deep learning model or unit in which every node is interconnected with other nodes.
This way, once information is fed to this machine, it can begin to process and determine whether the information has any errors or abnormalities. While this may be an oversimplification, Deep Boltzmann Machines (DBMs) are used in various studies to produce good and reliable multimodal learning models to represent processed information correctly.
There are many formats of multimodality. In the most common format, the architecture consists of multiple unimodal neural networks. In some cases, the features extracted from different modalities all pass to the same network (like early fusion). The output from these networks is then combined using various fusion techniques which process different data from each model and use that information to provide a more accurate result or prediction.
The initial stage of processing in a multimodal model is known as encoding. In this stage, each input model is processed by its respective unimodal network. For example, in an audiovisual model, one network might process audio data while another processes visual data. Once the information is extracted, it gets integrated or fused. Several fusion techniques — ranging from simple concatenation to attention mechanisms — can be used for this purpose.
The success of these models largely depends on the effectiveness of this multimodal data fusion. Finally, a ‘decision’ network accepts the fused encoded information and is trained on a specific task. This could involve making predictions or decisions based on the joint representation of data generated by the fusion module.
Considering the fact that multimodal ML uses different types of models to come to a single result, implementing the right fusion technique is crucial to maximising the result. The most popular ways of defining data fusion approaches are:
Early fusion involves combining the raw data from different modalities into a single input vector which is then fed to the network. This approach requires aligning and pre-processing the data, which can be challenging due to differences in data formats, resolutions and sizes.
Late fusion processes each modality separately and then combines their outputs at a later stage. Late fusion can better handle the differences in data formats and modalities but it can also lead to the loss of important information.
The intermediate fusion and hybrid fusion approaches may seem similar but they’re very different. The former combines information from different models in varying data processing stages. Hybrid fusion combines elements of both early and late fusion to create a more flexible and adaptable model. It’s the most widely used method, considered far superior to early or late fusion.
It’s important to note that choosing the right fusion model depends on several factors that may warrant the use of a specific fusion approach regardless of its popularity and widespread use. Task complexity, available resources, data characteristics, knowledge, and computational efficiency must be objectively considered to determine the best method.
Let’s explore the possibilities.