Benchmark for evaluating multi-step and constrained function calling under long-context scenarios. Tests LLM tool-use capabilities with complex, real-world API interactions.

Dataset

GitHub Repository

benchmarkagenticevaluation