Multimodal AI Modeling is the Future, But It’s Also Pandora’s Box

4 min readJan 6, 2022

by Jessica Hall

This site may earn affiliate commissions from the links on this page. Terms of use.

What’s it going to take to create an AI that can understand? We keep trying to make computers that act like brains, but we don’t have an easy path to building computers that can understand things like brains do. And what happens if we get what we asked for — intelligent systems that are smarter and faster than we are?

One key goal of neural networks and AGI (Artificial General Intelligence) is to mimic the fluent, responsive functions of the human brain to process complex information in real-time. We want the computer to understand what we want it to do. Right now, neural nets like the GPT-3 and DALL-E can respond to natural-language queries, and produce human-quality sentences and even snippets of intelligible computer code. But they lack awareness. They don’t do well with subtext. They struggle to understand.

DALL-E — an AI designed to create images from text descriptions — created the above when handed the text: “A stained glass window with an image of a blue strawberry.” Image by OpenAI via VentureBeat

Understanding is, ironically, not all that well understood. To understand what’s going on around us, we often need to be aware of direct sensory input from all our senses, as well as remembered context, and we need to know what information to filter out. (The human body does the filtering part very well — so well, in fact, that until right this moment you probably weren’t consciously aware of your posture, or your tongue.) Scientists have made huge leaps in analyzing the connectomes of humans and other species. In the process we’ve learned that the brain integrates many different streams of information at once, all the while comparing them to memories already stored.

The discipline of building or teaching a neural net to have this kind of multifaced awareness is called multimodal modeling. Tools like DALL-E are designed to generate images based on text descriptions, while CLIP (Contrastive Language-Image Pre-training) is intended to associate text and images more robustly than current AI models. Both are built by OpenAI, which writes:

Although deep learning has revolutionized computer vision, current approaches have


ExtremeTech is the Web’s top destination for news and analysis of emerging science and technology trends, and important software, hardware, and gadgets.