Show ChatGPT what you see: Voice and image features are live (for a price)
Have a riveting conversation with AI while it helps to fix your bike.
What you need to know
- OpenAI adds image and voice recognition functions to ChatGPT, with the latter exclusive to mobile devices alongside a new advanced text-to-speech engine.
- Both features require a subscription to either ChatGPT Plus or ChatGPT Enterprise.
- The update will gradually roll out to English-speaking users worldwide during the next two weeks.
ChatGPT is working towards developing a more natural user experience by implementing voice and imaging communication that works both ways. In theory, users can spend less time typing and pondering the most effective prompts and enjoy more time seeing answers. Detailing its plans to gradually roll out these new capabilities in a recent blog post, OpenAI explains who will have access and when.
Those subscribed to an individual $20 ChatGPT Plus or a business-focused Enterprise subscription will begin to see image-based prompts and responses within the next two weeks on all platforms. Meanwhile, voice conversations will be exclusive to iOS and Android devices, with a manual opt-in found in the app's 'Settings' menu under 'New Features.' OpenAI aims to mitigate errors by gradually deploying these new modes, so don't fret if you can't see them yet.
ChatGPT can now see, hear, and speak. Rolling out over next two weeks, Plus users will be able to have voice conversations with ChatGPT (iOS & Android) and to include images in conversations (all platforms). https://t.co/uNZjgbR5Bm pic.twitter.com/paG0hMshXbSeptember 25, 2023
Doesn't this technology already exist?
Although OpenAI takes apparent pride in this announcement, speech recognition and text-to-speech technologies have existed for years. Almost any smartphone app can transcribe your voice into written prompts, although the quality of results can vary depending on the underlying code. ChatGPT now uses Whisper, an open-source speech recognition system written by in-house developers, alongside a partnership with professional voice actors to train more lifelike speech for its generative AI.
While AI assistants like Bing Chat for mobile already exist on smartphones, ChatGPT demonstrates its new back-and-forth voice conversations with a rapid response time. Anything that reduces the time between interpreting spoken prompts and hearing a natural-sounding reply will undoubtedly appeal to anyone who prefers not to type on smaller screens.
An interesting tidbit from the announcement details how the Whisper model can generate 'human-like audio from just text and a few seconds of sample speech,' which could be more exciting as a concept for users to digitize custom-made voices for their AI assistants.
How can ChatGPT understand what it sees?
The most thrilling part of this update relates to ChatGPT's new ability to infer details from any image you provide. Opening up your smartphone's camera for a quick snap, you can optionally highlight specific areas of inquiry, as a demo video shows a user asking for help with lowering a bicycle seat. Sure enough, the app gives detailed answers with follow-up questions about the required tools. Naturally, the implication of mistaken identities and hallucinations immediately comes to mind, and OpenAI acknowledges the challenges ahead.
OpenAI already has experience with 'Be My Eyes,' an AI-powered mobile app that connects the sight-impaired community with volunteers who can help describe whatever the camera is pointed toward. Between this and the ChatGPT neural network, correctly identifying objects and scenes will increase over time thanks to this database of information. However, restricting the AI from making statements about the appearance of individuals is part of the balance between ethical guidelines and technical limitations.
Get the Windows Central Newsletter
All the latest news, reviews, and guides for Windows and Xbox diehards.
The image-recognition code harnesses a combination of GPT-3.5 and GPT-4, capable of recognizing anything from real-world photographs to digital screenshots and text documents. As with anything else related to the almost limited potential of ChatGPT, OpenAI explains that this emerging technology is focused foremost on the English language. However, that may change in the future and seems likely enough based on the recent (and rapid history) of generative AI.
Ben is a Senior Editor at Windows Central, covering everything related to technology hardware and software. He regularly goes hands-on with the latest Windows laptops, components inside custom gaming desktops, and any accessory compatible with PC and Xbox. His lifelong obsession with dismantling gadgets to see how they work led him to pursue a career in tech-centric journalism after a decade of experience in electronics retail and tech support.