"We automated 150 tasks with AI agents, just copy us": Microsoft's Windows Agent Arena brings AI assistants keyboard-deep to Windows PCs but there are critical security and performance concerns

(Image credit: Daniel Rubino)

What you need to know

Earlier this month, Microsoft unveiled a new benchmark called Windows Agent Arena, designed to provide a platform for testing AI agents in realistic Windows operating system environments.
Early benchmarks show that multi-modal AI agents have an average performance success rate of 19.5% compared to the coveted average human performance rating of 74.5%.
The benchmark is open-source and provides an avenue for deep research which could significantly enhance the development of AI agents. However, there are critical security and performance concerns abound.

With the emergence of generative AI and its broad adoption, the technology is rapidly transitioning from simple text and image-based prompts. NVIDIA CEO Jensen Huang predicted that the next phase of AI would be dominated by self-driving cars and humanoid robots, and we've seen major tech corporations like Tesla make significant leaps on that front.

Over the past few weeks, we've seen Salesforce CEO Marc Benioff throw lethal jabs at Microsoft over claims that it has done a major disservice to the AI industry. "Copilot is just the new Microsoft Clippy," added Benioff. "It doesn't work or deliver value."

The Salesforce CEO also used the opportunity to tout the company as "the largest AI supplier in the world" with the capability of doing "a couple of trillion AI transactions per week." In case you missed it, Microsoft recently announced Copilot Studio will soon support the creation of autonomous agents. Like Salesforce's Agentforce offering, Microsoft's Copilot agents will help automate tasks across IT, marketing, sales, customer service, and finance.

Benioff viewed Microsoft's announcement as a sign that the company is panicking. "Copilot is a flop because Microsoft lacks the data and enterprise security models to create real corporate intelligence," added Salesforce CEO. "Clippy 2.0, anyone?"

More interestingly, Microsoft unveiled a new benchmark called Windows Agent Arena earlier this month. For context, the benchmark is designed to promote the testing of AI agents in Windows operating system environments. As such, the benchmark could potentially expedite the development of AI assistants with advanced and sophisticated capabilities to handle complex tasks across various applications.

According to research:

“Large language models show remarkable potential to act as computer agents, enhancing human productivity and software accessibility in multi-modal tasks that require planning and reasoning. However, measuring agent performance in realistic environments remains a challenge.”

What is Windows Agent Arena, and how is it important in the AI revolution?

YouTube

Watch On

As highlighted above, Windows Agent Arena provides a platform for testing of AI agents in realistic Windows operating system environments, including apps like Microsoft Edge, Microsoft Paint, Clock, VLC media player, and more.

According to Microsoft:

"We adapt the OSWorld framework to create 150+ diverse Windows tasks across representative domains that require agent abilities in planning, screen understanding, and tool usage. Our benchmark is also scalable and can be seamlessly parallelized in Azure for a full benchmark evaluation in as little as 20 minutes."

Microsoft Research developed a multi-modal agent called Navi to explore the framework's capabilities. The AI model was asked to perform several tasks in the Windows Agent Arena benchmark, including turning a website into a PDF file and placing it on the main screen. Benchmarks shared indicate that the multi-modal agent had an average performance success rate of 19.5%, in contrast to the average human performance rating of 74.5%.

While the benchmark shows automating certain tasks using AI could be a stretch at this point, it provides a reliable platform for the improvement of AI agents.

Privacy and security continue to concern most users. For instance, Microsoft's controversial Windows Recall feature has sparked concern among most Windows users, prompting scrutiny by regulators. The tech giant abruptly recalls the controversial feature to fine-tune the experience by making it more secure. The feature should ship soon, but users can uninstall it.

Similarly, AI agents like Navi continue to spark concern among users as they become more sophisticated. As the tools become more advanced, they'll have more access to applications that often hold the user's personal credentials. It could potentially pose a significant threat, especially since hackers are embracing sophisticated ploys, including AI, which makes their attacks less obvious.

The Windows Agent Arena is open-source and presents more research opportunities, ultimately promoting expedited development of reliable and capable models. While responding to the security and performance concerns, the Microsoft researchers behind the platform told Windows Central:

“Our computer-controlling agent, named 'Navi,' is open-source, and our research project leverages models from OpenAI, such as GPT-4V, along with Microsoft’s Phi3. While both Windows Agent Arena and Navi are open-source, the specific models in use are separate and maintained by their respective providers.

The disparity between AI system performance and human-level intelligence remains a substantial, industry-wide challenge. We’re working to address this through continuous data curation, fine-tuning, and optimization, making steady progress toward narrowing this gap.

Our approach to responsible AI prioritizes ethical guidelines, with privacy and safety at the forefront. We ensure that AI agents avoid unauthorized access or information leaks, and users retain control to understand, direct, or override AI actions. As we advance in this field, our commitment remains firm: building AI that respects privacy, promotes fairness, and contributes positively to society.”

Elsewhere, Anthropic recently unveiled a new API dubbed “Computer Use” in open beta. Through the API, developers can "direct Claude to use computers the way people do—by looking at a screen, moving a cursor, clicking buttons, and typing text."

🎃The best early Black Friday deals🦃

💻Lenovo Yoga Slim 7x (X Elite) | $999.99 at Best Buy (Save $200!)
📺LG Curved OLED 32 (QHD, 240Hz) | $889.99 at Amazon (Save $610!)
🎮Amazon Fire TV Xbox Game Pass bundle | $74.99 at Amazon (Save $62!)
💻Alienware m16 R2 (RTX 4060) | $1,399.99 at Dell (Save $300!)
📺HP Omen 27qs (QHD, 240Hz) | $299.99 at Best Buy (Save $130!)
🔊2.1ch Soundbar for TVs & Monitors | $44.99 at Walmart (Save $55!)
💻HP OMEN Transcend 14 (RTX 4050) | $1,099.99 at HP (Save $500!)
🎧Sennheiser Momentum 4 ANC | $274.95 at Amazon (Save $125!)
📺LG C4 OLED 4K TV (42-inches) | $999.99 at Best Buy (Save $400!)

TOPICS

Kevin Okemwa is a seasoned tech journalist based in Nairobi, Kenya with lots of experience covering the latest trends and developments in the industry at Windows Central. With a passion for innovation and a keen eye for detail, he has written for leading publications such as OnMSFT, MakeUseOf, and Windows Report, providing insightful analysis and breaking news on everything revolving around the Microsoft ecosystem. While AFK and not busy following the ever-emerging trends in tech, you can find him exploring the world or listening to music.