Anthropic's upgraded Claude AI model outperforms OpenAI-o1 in coding and can use a Windows 11 PC like humans — potentially backing NVIDIA CEO's claim about software development being dead in the water

Anthropic's upgraded Claude 3.5 Sonnet AI model
Anthropic's upgraded Claude 3.5 Sonnet AI model (Image credit: Anthropic)

What you need to know

  • Anthropic recently shipped an upgraded version of Claude 3.5 Sonnet alongside a new model dubbed Claude 3.5 Haiku, with enhanced coding capabilities and more.
  • The AI firm also unveiled computer use, a new capability that allows users to prompt Claude to use computers as people do.
  • The company admits shipping the capability to the public poses great risks, but it plans will use the avenue to observe how people are leveraging the tool. It has elaborate measures in place to prevent misuse, such as restricted access to the web during training. 

The generative AI landscape is seemingly transitioning to the next phase beyond AI-generated images and text. Anthropic recently unveiled an upgraded version of Claude 3.5 Sonnet and a new model dubbed Claude 3.5 Haiku. According to the company, the upgraded version ships with enhanced coding capabilities and shares the same performance specs as Anthropic's Claude 3 Opus LLM. 

More interestingly, the new capability dubbed “Computer Use”, which is available in open beta. Through the API, developers can "can direct Claude to use computers the way people do—by looking at a screen, moving a cursor, clicking buttons, and typing text." This makes Claude 3.5 Sonnet the first AI model to provide computer use in public beta.

Anthropic admits that users could encounter several setbacks while interacting with the model, including errors and a not-so-seamless user experience. The company hopes to use feedback to enhance and improve the model's performance and efficiency.

Companies like Asana, Canva, Cognition, DoorDash, Replit, and The Browser Company have joined the fold to simplify processes that often require dozens of steps. For instance, "Replit is using Claude 3.5 Sonnet's capabilities with computer use and UI navigation to develop a key feature that evaluates apps as they’re being built for their Replit Agent product."

The upgraded version of Claude 3.5 Sonnet is available on the Anthropic API, Amazon Bedrock, and Google Cloud’s Vertex AI. Anthropic is expected to ship Claude 3.5 Haiku later this month.

Per benchmarks shared, Anthropic's updated Claude 3.5 Sonnet shows a significant performance boost, especially in coding. For instance, the tool's performance on  SWE-bench Verified has improved from 33.4% to 49.0%, which indicates it performs significantly better than publicly available models, including OpenAI Strawberry reasoning AI models while maintaining the same price and speed as its predecessor. 

Related: NVIDIA CEO claims coding could be dead in the water with the prevalence of AI

The model corrects its mistakes by making another attempt at a task when it "realizes" it has encountered an issue, swaying it away from the desired output. As you may know, OpenAI o1 and o1-mini are exceptionally great at coding and have passed OpenAI's research engineer hiring interview for coding at a 90-100% rate.

AI agents are here but proceed with caution

Claude | Computer use for automating operations - YouTube Claude | Computer use for automating operations - YouTube
Watch On

While the highlighted improvements are impressive, the updated Claude 3.5 Sonnet AI model completed less than half of the tasks assigned in an evaluation designed to establish its proficiency in modifying flight reservations. The model failed approximately a third of the time while attempting to initiate a return.

Read More: Salesforce says it can beat Microsoft in AI

Anthropic highlights the model struggles with zooming and scrolling, making it easy to miss pop-up notifications because of how it processes screenshots. “Claude’s Computer Use remains slow and often error-prone,” the company added. 

The company admits that releasing the model to the public poses significant risks but also outlines that the benefits of observing how the model is used outweigh the dangers. 

According to Anthropic:

"We think it’s far better to give access to computers to today’s more limited, relatively safer models. This means we can begin to observe and learn from any potential issues that arise at this lower level, building up computer use and safety mitigations gradually and simultaneously."

In an attempt to prevent misuse and bad actors from leveraging the tool's sophisticated capabilities to cause harm, the new Claude 3.5 Sonnet isn't trained on users’ screenshots and prompts. It's also restricted from accessing the web during training. Anthropic developed the model with classifiers, swaying it away from high-risk actions like creating accounts and posting on social media. 

🎃The best early Black Friday deals🦃

CATEGORIES
Kevin Okemwa
Contributor

Kevin Okemwa is a seasoned tech journalist based in Nairobi, Kenya with lots of experience covering the latest trends and developments in the industry at Windows Central. With a passion for innovation and a keen eye for detail, he has written for leading publications such as OnMSFT, MakeUseOf, and Windows Report, providing insightful analysis and breaking news on everything revolving around the Microsoft ecosystem. You'll also catch him occasionally contributing at iMore about Apple and AI. While AFK and not busy following the ever-emerging trends in tech, you can find him exploring the world or listening to music.