Every LLM Call Is a Stateless HTTP Request—and That's the Whole Game
Statelessness is the invisible constraint that shapes every LLM product decision—from token budgets and context-window management to agent architectures. Understanding it as a plain HTTP loop demystifies why RAG, MCP, and agent frameworks exist and where their real costs and failure modes come from.
An LLM API call is just an HTTPS POST that sends a JSON payload and gets tokens back. The server holds no session state, so every request must be self-contained. That constraint is not a bug—it is the architectural property that makes horizontal scaling, fault tolerance, and fair resource allocation possible. The client maintains the conversation history and ships the entire thing with each call, which is why ChatGPT seems to remember you: the frontend silently re-sends the whole transcript every time. This stateless foundation forces a hard trade-off. As conversations grow, the messages array balloons, and token costs rise super-linearly. Production systems cap context windows and apply LRU eviction, but naive eviction can drop critical constraints from early in a session. That tension drives the evolution from prompt engineering to context engineering (RAG, MCP, Skills) and onward to loop engineering, where autonomous feedback cycles built from stateless calls produce emergent stateful behavior.
The framing of LLM interaction as a plain HTTP loop is a useful corrective to the mystique around AI; it recasts agents, RAG, and tool-use as strategies for packing the right JSON into a POST body.
Calling the three stages Prompt Engineering → Context Engineering → Loop Engineering gives a clean mental model for why the industry moved from better prompts to retrieval and tool-use, and now to autonomous feedback loops.
The observation that the model doesn't need memory—you do—flips the typical narrative that statelessness is a limitation; it's the engineering property that makes reliable, scalable AI systems possible.