LLM

Every LLM Call Is a Stateless HTTP Request—and That's the Whole Game

By 凌涘 · Jul 5, 2026

Read original on juejin.cn ↗ Google Translate ↗ Alt translation

Statelessness is the invisible constraint that shapes every LLM product decision—from token budgets and context-window management to agent architectures. Understanding it as a plain HTTP loop demystifies why RAG, MCP, and agent frameworks exist and where their real costs and failure modes come from.

Summary

An LLM API call is just an HTTPS POST that sends a JSON payload and gets tokens back. The server holds no session state, so every request must be self-contained. That constraint is not a bug—it is the architectural property that makes horizontal scaling, fault tolerance, and fair resource allocation possible. The client maintains the conversation history and ships the entire thing with each call, which is why ChatGPT seems to remember you: the frontend silently re-sends the whole transcript every time. This stateless foundation forces a hard trade-off. As conversations grow, the messages array balloons, and token costs rise super-linearly. Production systems cap context windows and apply LRU eviction, but naive eviction can drop critical constraints from early in a session. That tension drives the evolution from prompt engineering to context engineering (RAG, MCP, Skills) and onward to loop engineering, where autonomous feedback cycles built from stateless calls produce emergent stateful behavior.

Takeaways

— An LLM API call is a standard HTTPS request; the server processes the JSON payload, returns tokens, and forgets everything.

— Statelessness enables linear horizontal scaling, automatic failover, and fair scheduling because any request can land on any server.

— The client, not the server, maintains conversation state and must resubmit the full message history with every call.

— Token costs grow super-linearly with input length, so production systems enforce a context-window cap and evict older messages.

— Naive LRU eviction can delete early constraints that later replies depend on, creating a tension between cost control and correctness.

— Context engineering (RAG, MCP, Skills) shifts the problem from writing better prompts to curating what the model sees.

— Loop engineering wraps stateless calls in execute-observe-reflect cycles, producing emergent stateful behavior from stateless primitives.

Conclusions

The framing of LLM interaction as a plain HTTP loop is a useful corrective to the mystique around AI; it recasts agents, RAG, and tool-use as strategies for packing the right JSON into a POST body.

Calling the three stages Prompt Engineering → Context Engineering → Loop Engineering gives a clean mental model for why the industry moved from better prompts to retrieval and tool-use, and now to autonomous feedback loops.

The observation that the model doesn't need memory—you do—flips the typical narrative that statelessness is a limitation; it's the engineering property that makes reliable, scalable AI systems possible.

Concepts & terms

Stateless LLM architecture

A design where the inference server retains no client context between requests; every call must carry the full message history and parameters in a self-contained payload, typically over HTTPS.

LRU eviction with capacity limit

A strategy for keeping context windows within a token budget by discarding the least recently used messages when the history exceeds a set threshold, analogous to a CPU cache eviction policy.

Context Engineering

The practice of managing what information reaches the model at inference time—via RAG, tool calls, or skill composition—rather than relying solely on prompt wording.

Loop Engineering

An emerging paradigm that wraps individual stateless LLM calls inside autonomous feedback loops (execute, observe, reflect, re-execute) to produce persistent, goal-directed behavior.

Source: juejin.cn ↗ Google Translate ↗ Backup ↗