Unveiling GILL: The Next Step in AI Chatbot Evolution
Written on
Engaging with an AI chatbot that crafts impressive text, such as ChatGPT, is certainly exciting. However, the experience becomes even more remarkable when an AI chatbot can generate both text and images—something that ChatGPT currently does not offer. Meet GILL, the new AI chatbot from Carnegie Mellon University, which excels in both areas.
You can witness GILL in action below:
GILL is the first AI chatbot that comprehends both text and images while also being able to generate them. Its innovative approach surpasses existing state-of-the-art models in unexpected ways. Welcome to the next generation of AI chatbots.
> If you want to stay updated with the fast-paced world of AI and feel inspired to take action, consider subscribing to my free weekly AI newsletter.
> Interested? Subscribe below and transform your understanding of AI while gaining insights that can change your life:
TheTechOasis
The newsletter to stay ahead of the curve in AI thetechoasis.beehiiv.com
When Text Intersects with Images
Developing an AI model that performs well in a single modality (text, images, etc.) is already a significant challenge. Creating a model that operates across multiple modalities is a rarity. When GPT-4 was introduced, its ability to understand both text and images astonished many. However, it is worth noting that GPT-4's multimodal capabilities are limited to data processing rather than generation.
In contrast, GILL uniquely combines the realms of text and images, providing an experience that closely resembles human-like interaction. Achieving this required overcoming considerable challenges.
How Machines Perceive Our World
While humans naturally grasp both images and text, conveying this understanding to machines is no simple task. The question arises: How can we explain to a machine, which only understands numbers, the complexity of our world?
To address this, we convert text and images into vectors known as embeddings, which capture the meaning of the original elements. This transformation serves two primary purposes: - It converts something incomprehensible to machines into a format they can interpret. - It positions these elements—whether text or image—within a multi-dimensional space.
This unique approach enables us to instruct machines on how our world operates, imparting context and meaning. The relationship between two vectors, or their proximity, illustrates this concept. For example, the words "cat" and "dog" are similar concepts and thus occupy adjacent positions in the embedding space, while "bird" and "bride" have entirely different meanings and are positioned further apart.
For clarity, see the illustration below:
In a text-only embedding space like ChatGPT's, concepts such as "A little boy is walking" and "A little boy is running" are close together, whereas "Look how sad my cat is" is further away.
If you’ve ever wondered how machines perceive our world, this is how they do it. You might think, "Problem solved!" but not so fast—this is more complicated than it appears.
Each model has its own representation of our world, meaning ChatGPT’s embedding space differs from that of Stable Diffusion. Consequently, a significant area of research in AI aims to establish commonality in these spaces. The goal is for models with entirely different modalities, like ChatGPT and Stable Diffusion, to share the same embedding space, allowing them to interpret our world similarly. For instance, we aim for a model that recognizes a text about a German shepherd and an image of a German shepherd as the same entity. Only solutions that achieve this common embedding space can be deemed truly multimodal, and that’s precisely what GILL accomplishes.
A Chatbot that Exceeds Expectations
GILL’s core contribution lies in enabling models that perceive the world differently to communicate and understand one another. It merges a large language model (LLM) with Stable Diffusion (a text-to-image model) into a singular embedding space, showcasing its multifaceted capabilities:
- Comprehend text
- Comprehend images
- Generate text
- Generate images
- Retrieve images
No other chatbot in existence can achieve this. For GILL to function effectively, it must: - Process images and text simultaneously - Determine when to output text or an image - Decide when to generate an image or retrieve one from a database
But how did three dedicated AI researchers manage to create this groundbreaking innovation?
The Genius of Maps as a Cost-Effective Solution
To accomplish this feat, the researchers developed the architecture depicted below. It may appear complex, but it's simpler than it looks:
Let's break it down quickly.
- Learning to Process Images (left side): GILL’s ability to process interleaved text and images, as demonstrated in the earlier gif, is one of its defining features. To train a machine for this capability, the team utilized a dataset of interleaved images and their corresponding captions. They developed a transformation matrix—indicated in green in the image—composed of parameters trained by the researchers. These parameters allow the model to convert images into vector embeddings that the text-only LLM can comprehend.
In simpler terms, they converted the image of a cat into a vector that the LLM recognizes and understands as a cat. The LLM then processed this vector alongside the original caption, generating the response, "A grey cat sitting in a basket," and compared it to the original caption, "A European shorthair cat in a woven basket," adjusting the model so that its predictions closely align with the original captions.
- Learning to Generate (right side): The goal is for GILL to not only produce a text response but also generate an image when necessary. To achieve this, the LLM was enhanced to indicate when an image is required. If an image is needed, a decision module assesses whether to retrieve it from a database or generate it with Stable Diffusion.
In layman’s terms, if GILL is asked to produce an image of a "pig with wings," it recognizes that such an image does not exist and must be generated.
Subsequently, the GILLMapper neural network transforms text embeddings from the LLM into embeddings that a text-to-image model like Stable Diffusion can interpret.
And there you have it.
For a clearer understanding, let’s look at a real-world example:
Left side of the image: - The model receives an image of several cookies along with the request, "How should we display them?" - It processes the image, transforms it, and combines it with the text embedding of the request.
Right side: - The LLM generates a response that includes both text and an image. - GILL’s decision module evaluates whether to retrieve an existing image or create a new one. In this case, it opts to generate a new image.
> Technical protip: When retrieving images from the database, GILL utilizes the relatedness concept previously discussed. It compares the vector embedding of the desired image with the image database using cosine similarity, identifying the closest vector to its target. The closer the vector, the more semantically similar the image it represents is to the desired one.
The output image tokens generated by the LLM are then transformed into valid image embeddings for Stable Diffusion, resulting in a response like, "I think they look best... between them," paired with the generated image.
With this innovative approach, GILL has redefined the future of AI chatbots.
But that’s not all—GILL has even more impressive features.
Surpassing Expectations
One of GILL’s most remarkable aspects is that, despite employing a relatively small language model (only 6.7 billion parameters, significantly smaller than leading LLMs), it achieves superior image generation in long-context scenarios while requiring minimal training resources (just 2 GPUs over 2 days).
What does this mean? It indicates that when generating images within a lengthy chat that considers the entire conversation's context, GILL outperforms the standard Stable Diffusion model (even though they are fundamentally identical) in producing semantically meaningful images.
The next image clearly illustrates this point:
A New Dawn
To me, GILL signifies a new beginning for chatbots. The advent of GILL makes traditional text-only chatbots feel inadequate. The future lies in chatbots that facilitate seamless interactions by integrating multiple modalities into a cohesive experience. GILL has paved the way, and now others must follow.
Want to explore more of my AI articles? Join here!