Artificial Intelligence · LLaMA · Agent

llama.cpp b9754: A Tiny Parser Fix That Makes Agent Tool Calling Reliable

By 武子康 · Jun 28, 2026

Read original on juejin.cn ↗ Google Translate ↗ Alt translation

As llama.cpp evolves from a local inference toy into a full Agent runtime, reliability of structured output becomes as important as raw token throughput. This fix addresses a class of intermittent failures that are notoriously hard to debug — the kind that break production Agent pipelines without obvious cause. For any team building local or private-deployment Agents on llama.cpp, this patch is a silent stability upgrade.

Summary

A new commit in llama.cpp, b9754, fixes a subtle but critical bug in the peg-native parser that caused occasional tool calling failures in Agent scenarios. The problem: when models generate XML-style tool calls, the grammar generation phase (GBNF) and the parsing phase (PEG) disagreed on where parameter values ended. This allowed the model to sometimes output duplicate closing tags like `</parameter>`, which the parser then rejected, breaking the tool call.

The fix introduces an Aho-Corasick (ac) parser into `common/peg` that makes grammar generation stricter. Instead of allowing the model to sample tokens that might later prove unparseable, the new approach explicitly models delimiter boundaries as a state machine. The parser now ensures that the generation phase only produces structures the parsing phase can handle.

For developers running llama.cpp as a local Agent runtime — using `llama-server` with tool calling, streaming, custom chat templates, or XML-style tool formats — this update directly improves reliability. The fix doesn't affect normal chat, but it patches a foundational layer of the Agent stack: the pipeline that turns raw model tokens into structured, executable actions.

Takeaways

— llama.cpp b9754 introduces an Aho-Corasick (ac) parser in the common/peg module to fix delimiter boundary inconsistencies.

— The root cause was a semantic mismatch: GBNF grammar generation allowed token sequences that the PEG parser could not parse, specifically around `until(delim)` and `literal(delim)` combinations.

— The bug manifested as occasional duplicate `</parameter>` tags in XML-style tool calls, causing parsing failures and interrupted streaming.

— The fix makes grammar generation stricter, preventing the model from sampling illegal structures at the source rather than relying on post-hoc parsing tolerance.

— Normal chat and simple Chat Completion are unaffected; the fix targets tool calling, streaming, and custom chat template scenarios.

— Recommended regression tests include multi-parameter tool calls, string parameters with special characters, streaming stability, and custom template compatibility.

— The relevant PR is #24869 and the original issue is #24863 in the ggml-org/llama.cpp repository.

Conclusions

The shift from 'compute tokens fast' to 'produce reliably parseable tokens' marks llama.cpp's transition from a hobbyist inference tool to a production-grade Agent runtime.

Fixing the generation phase rather than making the parser more lenient is the correct engineering choice for Agent safety — tool calls can trigger real-world actions, so 'guessing' the model's intent is dangerous.

This fix highlights a growing class of problems in LLM infrastructure: the complexity is moving from matrix math to protocol, template, and state machine engineering.

The intermittent nature of this bug — dependent on sampling path, context, and parameter content — makes it a classic 'hard to reproduce, hard to debug' production issue that silently erodes user trust.

Teams using llama.cpp for Agent services should treat structured output reliability as a fourth key metric alongside throughput, latency, and memory usage.

Concepts & terms

GBNF (GGML BNF)

A grammar format used in llama.cpp to constrain token generation during inference. It tells the model which tokens are allowed next, effectively guiding structured output like JSON or XML tool calls.

PEG (Parsing Expression Grammar) parser

A type of formal grammar parser used in llama.cpp to parse model output into structured data, such as extracting tool call parameters from generated text. It defines unambiguous rules for how text should be interpreted.

Aho-Corasick (ac) parser

A string-searching algorithm that efficiently matches multiple patterns simultaneously. In this context, it's used to model delimiter boundaries as a state machine, ensuring the generation phase only produces sequences the parser can handle.

Constrained decoding

A technique that restricts the model's token sampling to only those tokens that conform to a predefined grammar or schema, ensuring the output is valid structured data (e.g., valid JSON or XML) from the start.

Source: juejin.cn ↗ Google Translate ↗ Backup ↗