AgentFloor Benchmark: Small Open-Weight Models Match GPT-5
Discover how the AgentFloor benchmark reveals small open-weight models match GPT-5 on routine tasks, enabling cost-effective AI agent architectures.
TL;DR: The AgentFloor benchmark evaluated 16 open-weight models against GPT-5 across 16,542 runs. Results show small models match frontier performance on routine tasks, while large models are still needed for complex long-horizon planning. This suggests a hybrid architecture where small models handle most work to cut costs.
Key facts
- AgentFloor is a deterministic 30-task benchmark organized into a six-tier capability ladder [1, 6].
- The study evaluated 16 open-weight models (0.27B to 32B parameters) and GPT-5 across 16,542 scored runs [1, 6].
- Small and mid-sized open-weight models are sufficient for most short-horizon, structured tool-use tasks [1, 5].
- The strongest open-weight model matched GPT-5’s aggregate performance on the benchmark [1, 6].
- Open-weight models were found to be substantially cheaper and faster to execute than GPT-5 [1, 6].
- Frontier models retain an advantage in long-horizon planning requiring sustained coordination [1, 5].
- Neither model category achieved strong reliability on long-horizon planning tasks [1, 6].
Small Models, Big Impact
For years, the assumption in the AI community has been that building reliable agent workflows requires massive, frontier-grade language models. These large models are expensive to run and slow to respond, yet they are often used for simple, repetitive tasks that do not require deep reasoning. A new study challenges this assumption, showing that smaller, open-weight models are already capable of handling the bulk of agentic work.
The research, titled ‘AgentFloor: How Far Up the Tool Use Ladder Can Small Open-Weight Models Go?’, introduces a new benchmark designed to test the practical limits of smaller models in agent pipelines [1]. The findings suggest a shift in how we should design AI systems: instead of using one giant model for everything, we can use a mix of small and large models to save money and time without losing capability.
The AgentFloor Benchmark
The core of the study is AgentFloor, a deterministic benchmark consisting of 30 distinct tasks [1, 6]. These tasks are organized into a six-tier capability ladder that ranges from simple instruction following to complex, long-horizon planning [1, 6]. The ladder tests various skills, including basic tool use, multi-step coordination, and the ability to track constraints over many steps [1, 6].
Researchers evaluated 16 open-weight models, ranging from 0.27 billion to 32 billion parameters, alongside GPT-5 [1, 6]. The evaluation involved 16,542 scored runs, providing a robust dataset for comparison [1, 6]. The goal was to answer a practical question: which parts of an agent workflow truly require large frontier intelligence, and which can be handled by smaller, cheaper models?
Routine Tasks: A Tie
The results showed a clear boundary of model necessity. Small and mid-sized open-weight models were found to be sufficient for most short-horizon, structured tool-use tasks [1, 5]. These tasks dominate real-world agent pipelines, where systems make many model calls per user request, and most of those calls are short and routine [6].
In aggregate, the strongest open-weight model matched GPT-5’s performance on the benchmark [1, 6]. This is a significant finding because it suggests that for the majority of daily agentic work, users do not need to pay for the most powerful models available. Furthermore, the open-weight models were found to be substantially cheaper and faster to execute than GPT-5 [1, 6]. This cost and latency advantage makes them highly attractive for production systems where efficiency is critical.
The Planning Gap
However, the study also identified areas where larger models still hold an advantage. Frontier models like GPT-5 performed better on long-horizon planning tasks [1, 5]. These tasks require sustained coordination and reliable constraint tracking over many steps, which is more challenging for smaller models [1, 5].
Despite this advantage, neither open-weight nor frontier models reached strong reliability on long-horizon planning tasks [1, 6]. This indicates that while large models are better at complex planning, they are not yet perfect. The gap between small and large models is most pronounced in these complex scenarios, but both categories still have room for improvement in terms of reliability.
A Hybrid Architecture
Based on these findings, the study suggests a hybrid design principle for building AI agents [1, 6]. The authors recommend deploying smaller open-weight models for the broad base of routine actions, such as file operations and simple tool use [1, 5]. These models can handle the high-volume, low-complexity tasks efficiently.
For the narrower class of tasks that demand deeper planning, such as long-horizon coordination, larger frontier models should be reserved [1, 5]. This approach allows developers to balance cost and performance, using small models for speed and economy, and large models for complexity and reliability.
Industry Context
This finding aligns with broader industry observations. Other evaluations have noted that open models are crossing a threshold in core agent tasks [8]. Models like GLM-5 and MiniMax M2.7 have been shown to match closed frontier models on core agent tasks at a fraction of the cost [8]. The AgentFloor benchmark provides concrete evidence for this trend, showing that the performance gap between open and closed models is narrowing for routine tasks.
Open Source and Reproducibility
The researchers have released the benchmark, harness, sweep configurations, and full run corpus to the public [1, 6]. This commitment to open science allows other researchers and developers to reproduce the results and build upon them. By providing these resources, the study encourages further exploration of hybrid agent architectures and the capabilities of smaller models.
Conclusion
The AgentFloor benchmark challenges the prevailing assumption that large frontier models are universally necessary for agentic workflows. By showing that small open-weight models can match GPT-5 on routine tasks, the study opens the door to more cost-effective and efficient AI systems. As the industry continues to evolve, a hybrid approach that leverages the strengths of both small and large models may become the standard for building reliable and affordable AI agents.
The release of the benchmark and its associated resources marks a significant step forward in our understanding of model capabilities. It provides a clear roadmap for developers looking to optimize their agent pipelines, balancing performance with cost and speed. As small models continue to improve, the role of large frontier models may shift from handling all tasks to focusing on the most complex and challenging ones.
This research highlights the importance of evaluating models in realistic, task-specific contexts rather than relying on general benchmarks. By focusing on the practical needs of agent workflows, the AgentFloor benchmark offers valuable insights for the future of AI development.
Sources
- AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go? (arxiv.org)
- AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go? (arxiv.org) — 2026-05-01
- Open Models have crossed a threshold (www.langchain.com) — 2026-04-02