Tool calling quality is noisy in a way LLM text generation isn't. The difference between "works" and "explodes" is tiny, and traditional benchmarks miss it. We need tool-specific evaluation frameworks. It would almost immediately become one of the most sought-after metrics. #AgenticAI #ToolCalling #LLM #MLevaluation #AIinfra #machineLearning #hermesAgent #openclaw #claudecode