跪拜 Guibai
← Back to the summary

llama.cpp b9754: A Tiny Parser Fix That Makes Agent Tool Calling Reliable

llama.cpp: A Small but Critical Fix for Agent Tool Calling

![llama.cpp b9754 Tool Calling Fix: Hand-drawn style cover showing model output Tokens → parsed into tool_call → XML parameter boundary fix flowchart, emphasizing "Not a performance upgrade, but a boundary fix"]

TL;DR

Version Matrix

Feature Status Description
common/peg ac parser implementation ✅ Verified PR #24869 introduces ac parser in common/peg to handle delimiter boundary consistency issues
Stricter peg-native grammar generation ✅ Verified Fixes semantic loophole where illegal prefixes could be swallowed when combining until(delim) and literal(delim)
XML-style tool calling stability improvement ⚠️ Pending verification Needs regression testing with multi-parameter, long string parameters, and streaming scenarios
Compatibility with --jinja and custom chat templates ⚠️ Pending verification Custom template and peg-native parser compatibility needs business-side verification
llama-cli local chat experience change ❌ Not applicable No noticeable change in normal chat scenarios
Normal Chat Completion behavior change ❌ Not applicable No noticeable change when not using peg-native tool calling path

TL;DR

llama.cpp b9754 is not a "major version upgrade" nor a performance explosion. What truly deserves attention is that it fixes a very specific but critical issue: making grammar generation in peg-native tool calling stricter, preventing the model from generating seemingly legal but actually unparseable bad structures in XML-style tool calls.

More specifically, the related PR implements an ac parser in common/peg to handle the "read until a certain delimiter" scenario. It solves the semantic inconsistency between the generation phase and the parsing phase.

This type of fix seems low-level on the surface, but its actual impact is on the reliability of Agent tool calling.

If you only use llama.cpp for normal chat, this version may not be noticeable; if you use llama-server for OpenAI-compatible API, tool calling, structured output, or Agent Runtime, this kind of fix is very worth paying attention to.

1. llama.cpp is no longer just for running models locally

llama.cpp's earliest impression was "a C/C++ project for running LLaMA locally." But now, it's no longer just a command-line inference tool; it's a highly complete local inference stack.

It covers many capabilities:

Model loading and GGUF format
Multi-backend inference: CPU, Metal, CUDA, ROCm, Vulkan, SYCL, OpenVINO
Quantized model inference
llama-cli local command-line entry
llama-server OpenAI-compatible service entry
Grammar constrained decoding
Tool calling, chat template, parser, server streaming

So now when looking at llama.cpp, you can't just look at "how fast it runs." It is gradually becoming an edge-side, local-side, private deployment-side LLM Runtime.

This is also why a parser/grammar-level fix is worth a dedicated article.

Because in the Agent era, an inference framework must not only compute tokens but also ensure that tokens can stably become structured actions consumable by upper-layer applications.

![From inference tool to Runtime: Hand-drawn infographic, left side shows early "local inference tool" notebook, right side shows today's Agent Runtime stack from bottom to top: hardware backend, GGUF, llama-server, grammar/parser, tool_calls, Agent application]

2. What exactly did b9754 change?

The core change can be summarized in one sentence:

Implement ac parser in common/peg to make grammar generation stricter, preventing tool calling output from escaping syntax constraints.

There are several keywords in this sentence.

common/peg: It means this is not a change to the model inference kernel, nor to CUDA, Metal, or ROCm backends, but to the common parsing logic.

ac parser: This can be understood as a processing approach related to Aho-Corasick automaton, suitable for handling "multiple delimiter matching" and "read until a certain delimiter" problems.

stricter grammar generation: The focus is not on fallback during parsing, but on preventing the model from sampling erroneous structures during the generation phase as much as possible.

This direction is very important.

The stability of structured output and tool calling cannot rely solely on post-output parsing. A better approach is to restrict illegal paths through grammar during the token generation process.

3. Why does tool calling break?

Many models now support tool calling, but different inference frameworks implement tool calling in different ways.

In the OpenAI-style interface, tool calling is usually expressed as structured JSON:

{
  "tool_calls": [
    {
      "function": {
        "name": "read_file",
        "arguments": "{\"filePath\":\"/tmp/a.txt\"}"
      }
    }
  ]
}

But in some chat templates, tool calling may be organized by the model into XML style:

<parameter=filePath>
/Users/demo/file.txt
</parameter>
<parameter=startLine>
1
</parameter>

This format itself is fine, but it is extremely sensitive to boundaries.

For example, when does a parameter value end?

Is the value of filePath:

/Users/demo/file.txt

Or:

/Users/demo/file.txt
</parameter>

If the generation phase and the parsing phase have inconsistent understandings of this boundary, problems will occur.

A typical problem that appeared in the related issue this time is: the model occasionally generates duplicate </parameter>:

<parameter=filePath>
/Users/.../file.story
</parameter>
</parameter>
<parameter=startLine>
1
</parameter>

A human can see at a glance that there is an extra closing tag. But for the inference system, the problem is not "can a human understand it?", but:

Why was this structure allowed to be sampled during the generation phase?

Why can't the parsing phase accept it?

This is the core contradiction that fixes like b9754 aim to resolve.

4. The real problem: GBNF and PEG have inconsistent understandings of boundaries

Two concepts need to be distinguished here.

GBNF is mainly used in the generation phase. It tells the model: which tokens are allowed to be generated next, and which tokens should not be generated. Its role is to constrain the sampling path.

PEG parser is mainly used in the parsing phase. After the model output is complete, the parser attempts to parse the text into a tool calling structure.

Ideally, these two should have the same set of semantics.

That is, what GBNF allows to be generated should be parseable by the PEG parser; what the PEG parser considers illegal should not be allowed by GBNF to be generated.

But the problem this time lies precisely here.

In logic like Until('\n</parameter>\n'), the GBNF generation grammar might allow certain delimiter prefixes to be swallowed by the parameter value, and then match the complete delimiter later. As a result, the model might generate:

value
</parameter>
</parameter>

From the perspective of the generation grammar, it might be considered a legal path.

But the PEG parser uses another boundary during parsing: it considers the parameter ended when it encounters the first complete </parameter>. So the remaining second </parameter> becomes residual text that cannot be explained.

The final manifestation is:

The model generates a tool call
The server attempts to parse it
Parsing fails
tool_calls is not correctly returned
The streaming task is interrupted

The most troublesome aspect of this type of bug is that it does not occur every time. It depends on the model output, context, sampling path, template format, and tool parameter content, so it manifests online as "occasional tool calling failure."

Occasional problems are harder to debug than problems that can be stably reproduced.

![Inconsistent boundary understanding: Left side shows GBNF generation gradually generating tokens and treating content after the first as legal output; right side shows PEG parser stopping at the delimiter and reporting "duplicate closing tag + parsing failure"; red dashed line in the middle marks the delimiter boundary conflict]

5. What does the ac parser fix?

The core goal of the ac parser introduced this time is not to make the parser more lenient, but to make grammar generation stricter.

The original problem was:

When combining until(delim) and the subsequent literal(delim), the generation grammar might allow part of the delimiter to be consumed by the preceding value.

The new approach is closer to:

Match and consume until the first occurrence of the delimiter, and incorporate the delimiter boundary into a consistent automaton process.

It can be understood as a state machine problem.

Suppose the delimiter is:

\n</parameter>\n

When the generator sees:

\n</parameter>

This prefix close to the complete delimiter, it cannot arbitrarily allow the value to end here, nor can it allow this prefix to be swallowed as ordinary value content, only to have a complete delimiter appear again later.

Aho-Corasick-like automata are suitable for handling this kind of "scan text until encountering one or more delimiters" problem. It does not rely on simple string lookup, but explicitly models the prefix, suffix, and state transition relationships of the delimiter.

This makes it easier for the grammar to know:

Is the current state already part of the delimiter?
If a certain character continues to be generated, will it form a complete delimiter?
If a complete delimiter is formed, where should it stop?
If it is only a prefix of the delimiter but the subsequent part doesn't match, how should it backtrack?

These details rarely appear in ordinary business development, but they are very critical in constrained decoding.

6. Why is this important for Agents?

Many people look at inference frameworks focusing only on three metrics:

Throughput
Time to first token
Memory usage

These three are certainly important, but the Agent scenario has a fourth metric:

Structured output reliability.

Agents are not normal chat. In normal chat, if the model outputs an extra tag or misses a bracket, the user might still understand. But Agent tool calling is different.

Tool calling failure means:

File reading fails
API call fails
Database query fails
Robot control command fails
Business process is interrupted

The user sees "why is the model broken again," but the root cause may not be the model's capability, but the instability of the Runtime's structured output pipeline.

Especially in local inference and private deployment, many teams use llama.cpp for these scenarios:

Local code assistant
Offline Agent
Enterprise intranet knowledge base
Robot voice control
Low-cost edge inference

Once tool calling fails occasionally, the user experience will be very poor.

So although changes like b9754 look like just a patch at the parser level, they are actually patching the foundation of the Agent Runtime.

![Don't just rely on parsing fallback: Hand-drawn five-step pipeline showing generation constraints → clear parsing → parameter validation → permission control → safe execution; the bottom three layers of foundation are generation guarantee layer, parsing verification layer, and execution protection layer]

7. Why not simply tolerate errors during parsing?

Some might ask: since there is an extra </parameter>, why not just have the parser ignore it?

This is one approach, but not the highest priority fix.

The reason is simple: tool calling is a high-risk boundary.

If the parser is too lenient, it might swallow originally erroneous structures, or even misparse them as another tool parameter.

For example:

<parameter=filePath>
/tmp/a.txt
</parameter>
</parameter>
<parameter=delete>
true
</parameter>

In this case, which tag should be ignored? Should parsing continue? Is the subsequent parameter trustworthy?

Tool calling is not ordinary text; it may trigger real operations. The closer to the execution layer, the less you can casually "guess the model's intent."

So a more reasonable strategy is:

Be as strict as possible during the generation phase, not allowing the model to generate illegal structures
Maintain clear boundaries during the parsing phase, not blindly swallowing errors
Perform parameter validation and permission control during the execution phase

b9754 fixes the first layer: generation constraints.

8. Do ordinary users need to upgrade?

If you only use llama.cpp for local chat, for example:

llama-cli -m model.gguf

This update will most likely not bring noticeable changes.

If you use llama-server, but only for normal Chat Completion, you may not notice it either.

The real beneficiaries are the following scenarios:

Using --jinja
Using custom chat templates
Using XML-style tool calling
Using peg-native chat format
Using tool calling and streaming
Model output needs to be strictly parsed into tool_calls
Local inference service needs to interface with upper-layer business systems

Especially when using complex templates, multi-parameter tool calling, reasoning models, and multi-turn Agents, these problems are more likely to surface.

If you have seen logs or phenomena like the following, you should upgrade and verify:

common_chat_peg_parse: unparsed peg-native output
srv stop: cancel task
tool_calls not returned
stream aborted
duplicate </parameter>
malformed tool-call XML

9. How should you test after upgrading?

If you are using llama.cpp for Agent services, it is not recommended to only test "whether a tool can be called once."

It is recommended to do several types of regression testing:

First, multi-parameter tool calling.

Second, string parameters containing newlines, paths, XML-like text, and JSON fragments.

Third, whether tool calling output is stable in streaming mode.

Fourth, whether the model can still generate normal answers.

Fifth, whether custom chat templates are compatible with the peg-native parser.

Sixth, whether erroneous tool calls are explicitly rejected rather than misparsed.

Enterprise-grade Agents need to test stability under high frequency, multi-turn, complex parameters, abnormal parameters, and long contexts.

10. From an engineering perspective, what does this fix illustrate?

Behind this fix lies a very important engineering reality:

The complexity of LLM Runtimes is expanding from "matrix computation" to "protocols, templates, syntax, parsing, state machines."

Early inference frameworks competed on:

Who supports more quantization formats
Who is faster
Who can run on more hardware

Now, moving towards Agent scenarios, the competition points will become:

Whose tool calling is more stable
Whose structured output is more reliable
Whose server streaming is more robust
Whose compatibility with chat templates is better
Who can handle multi-model, multi-tool, multi-turn calls
Who can provide controllable behavior when model output is imperfect

This is also why llama.cpp is increasingly resembling a complete inference stack.

It is not only responsible for computing tokens, but also for turning tokens into structures that upper-layer applications can safely consume.

11. Conclusion

The core significance of llama.cpp b9754 is not that "a new parser was added," but that it exposed a key fact about Agent Runtimes:

Model output is not the final result; being able to be stably, strictly, and controllably parsed and executed is the engineering result.

In the era of normal chat, inference frameworks only needed to compute tokens.

In the Agent era, inference frameworks must also ensure that tokens can become reliable structured actions.

b9754 fixes a boundary problem in this structured action pipeline.

It is small, but the direction is very correct.

For those using llama.cpp for local Agents, tool calling, robot control, and private LLM services, this type of update is worth continuous tracking.

![Agent Runtime Foundation: Hand-drawn horizontal pipeline showing user request → model generation → tool_call → execute tool → return result → continue conversation → reliable execution; the bottom four colored foundations are grammar, parser, template, state machine]

References


Error Quick Reference Card

Symptom Root Cause Diagnosis Fix
llama-server occasional tool call failure, log shows duplicate </parameter> Inconsistent understanding of until(delim) boundary between peg-native grammar generation and PEG parser Check common_chat_peg_parse: unparsed peg-native output and model output XML fragment Upgrade to version containing PR #24869 (b9754) to let grammar reject illegal paths during generation
tool_calls field missing in streaming response, srv stop: cancel task appears Parser stops at first complete </parameter>, remaining closing tags become residual causing parse failure Compare generated content with parser truncation point in stream logs Upgrade to b9754 and retest streaming; maintain clear parser boundaries instead of swallowing errors
Multi-parameter tool calling occasionally reports malformed tool-call XML String parameters containing \n, </parameter> prefix, or XML-like text trigger delimiter prefix being swallowed by value Reproduce with string parameters containing paths, newlines, JSON-like text Upgrade to ac parser version; perform six types of regression testing as per error quick reference card
Custom chat template occasionally fails when used with peg-native parser Template output uses peg-native chat format, but grammar does not restrict generation path Under --jinja + custom template, verify behavior against PR #24869 Upgrade to b9754 and perform template/parser compatibility testing
Occasional stream aborted with occasional success on retry Illegal structure depends on sampling path, not 100% reproducible Reproduce with multiple sampling runs or higher temperature; record generated token sequence After upgrade, eliminate illegal paths at the source through strict grammar generation