ChatGPT has the potential to redefine the way we search the internet, but currently, it's limited to text. This ignores one of the most used search engine features: images.
To that end, Microsoft has now unveiled Visual ChatGPT, an upgrade to the chatbot that enables it to both produce images from text and process image prompts uploaded by users.
While OpenAI itself has already dabbled in AI image generation with the DALL-E-2 system, Microsoft has set its sights higher. Visual ChatGPT is a step toward the multimodal AI that Microsoft revealed it was aiming for with the GPT- 4 upgrade coming to Bing with ChatGPT soon.
This means that image processing could soon be joined by AI-powered video and sound tools.
The science bit — How does Visual ChatGPT work?
Bing with ChatGPT runs on Open AI’s GPT Large Language Model (LLM) and Microsoft’s own Prometheus model. Most AI art generators utilize a Visual Foundation Model (VFM) like Stable Diffusion to produce images. They are normally effective but rather limited in their scope. Microsoft revealed (opens in new tab) that to create Visual ChatGPT they managed to bolt a plethora of different VFMs onto the flexible GPT model.
This was achieved via the creation of a “Prompt Manager” which Microsoft describes as helping “To bridge the gap between ChatGPT and these VFMs” that enables ChatGPT to “leverage these VFMs and receives their feedback in an iterative manner until it meets the requirements of users or reaches the ending condition.”
How does it differ from AI image generators?
This has created an AI tool that can generate images from text and image prompts, deal with complicated requests that span multiple processes, and even offer input and feedback on images uploaded or generated.
Microsoft included an example on its Github (opens in new tab) page of a user asking the AI what color a motorbike was or getting it to identify the contents of a picture, asking “What is in this image?” to which the AI responded, “The image contains a yard.” It is interactions like this, and the ability to tweak and edit an image multiple times within the same session that separates it from standard AI image generators.
What could Visual ChatGPT be used for?
If a Google Image search has ever left you wanting, then Visual ChatGPT could be a great way to create and refine an image that may not exist online already.
Photo editing software like Photoshop can be expensive and complex to use, asking Bing to remove an object from an image or change a background’s color is a much quicker and simpler method.
The specific uses of such a tool are countless. Professionals could find a lot of use for Visual ChatGPT. Architects and interior designers could show clients what painting that wall blue or removing it completely would look like. While visually impaired users could receive accurate AI descriptions of uploaded images.
Reservations and concerns
Of course, AI tools are still in their relative infancy and with the likes of Bing and Google Bard making high-profile errors and battling quirks —we miss you Sydney — there will likely be similar issues with Visual ChatGPT.
Similarly, when it comes to the internet, there will always be safety concerns. Inappropriate content is bound to make its way to Visual ChatGPT and it will be interesting to see how Microsoft handles explicit content with its image and video AI tools. Even with content filters, they may be ways to bypass these similar to the jailbroken ChatGPT "alter-ego" DAN.
The rise of edits and tweaks to photos may also bring into question the authenticity of any image and video we see online. Social media already often features heavily idealized snapshots of life and it’s easier to see some being deceptive with these tools. Video and audio deep fakes are already a problem when it comes to spreading disinformation and this will need to be monitored carefully.