Every LLM Call Is a Stateless HTTP Request—and That's the Whole Game

Stateless LLM Architecture Deep Dive: From HTTP Protocol to Context Engineering

I. Introduction: Uncovering the Essence of LLM Calls

What really happens behind the scenes when we type await client.chat.completions.create(...) in the terminal?

The answer is simpler than most people think—it's just an HTTP request. Your code sends a piece of JSON data (containing the model name, message list, and parameter configuration) via the HTTPS protocol to a remote inference server. The server consumes GPU compute power to perform forward inference, then serializes the generated tokens into JSON and returns them. The entire process completes within a few hundred milliseconds, the connection closes, the state is cleared, and the server turns to handle the next request.

This is the essence of an LLM call: a standard HTTP request-response interaction that consumes compute power, generates a result, and then forgets everything.

Understanding this is the foundation for understanding all the superstructure of modern AI engineering.

II. Stateful vs. Stateless: An Inevitable Architectural Choice

To understand the value of "statelessness," we first need to understand its opposite—"statefulness."

Imagine the simplest stateful LLM service: User A initiates a conversation, and the server allocates a block of memory to store A's conversation history. When User A sends another message, the request must be routed to the same block of memory on the same server to retrieve the history, append it to the new prompt, and then perform inference. If User A's second request unfortunately gets assigned by the load balancer to a different server, that server will be completely baffled—"Who are you? What did we talk about?"

This architecture might barely function in low-concurrency scenarios, but once it faces a production environment with millions of daily requests, problems immediately surface:

Massive Memory Pressure: For every active session maintained, the server must hold a copy of the conversation history in memory. With 100,000 concurrent users and an average context of 8K tokens, the memory overhead is enough to crush any single machine.
Inability to Scale Horizontally: State is bound to specific nodes. Newly added servers cannot share the burden of stateful requests; scaling up becomes a nightmare of "state migration."
Single Point of Failure Risk: If the server holding user states crashes, all bound sessions are lost instantly, and the user experience drops to zero.

The stateless architecture takes a fundamentally different approach: The server retains no client context; every request must be entirely self-contained. This aligns perfectly with the native design philosophy of the HTTP protocol—each GET/POST request is independent and dependency-free; the server forgets it after processing. The design principles of RESTful APIs, such as using Headers for identity authentication (Authorization) and the Body for the request payload, are the engineering expression of this philosophy.

This brings three decisive advantages:

Freedom of Horizontal Scaling: Because each request is self-contained, it can be correctly processed regardless of which server in the cluster it lands on. Adding machines adds capacity; linear scaling requires no state synchronization.
Strong Fault Tolerance: If one server crashes, requests are automatically transferred to another, completely transparent to the user. Without state, there is no "state loss."
Fairness: All requests are equal in the eyes of the server. There is no resource allocation imbalance caused by "priority for old users" or "sticky sessions."

III. Stateless LLM Practice: Bringing the Entire Conversation Every Time

The principle of statelessness is clear, but implementing it in an LLM scenario presents an unavoidable challenge: Large models themselves have no memory.

When you ask ChatGPT "What's my name?" and it can answer, it's not because it remembers you. It's because the client, with every request, silently stuffs the entire conversation history—"User said: Please remember my name is Byte Dai -> Assistant said: Okay, I've noted that -> User said: What is my name?"—into the messages array.

This is the core logic demonstrated in demo/index.mjs. Look at this code:

const chatHistory = [
  { role: 'system', content: 'You are a rigorous assistant' }
];

// First request
chatHistory.push({ role: 'user', content: 'Please remember my name is Byte Dai' });
const response = await client.chat.completions.create({
  model: 'deepseek-v4-flash',
  messages: chatHistory  // Bring the current history
});
chatHistory.push({ role: 'assistant', content: response.choices[0].message.content });

// Second request
chatHistory.push({ role: 'user', content: 'What is my name?' });
const response2 = await client.chat.completions.create({
  model: 'deepseek-v4-flash',
  messages: chatHistory  // Bring the complete history
});

This simple demo reveals the core engineering paradigm of stateless LLMs: State is not stored on the server side but is maintained by the client and submitted in full with every request. Each time the model performs inference, it's like reading a brand new script—it reads from the first word to the last, then improvises the next segment. It doesn't know what happened in the previous act because it never saw that script. It's the client that stitches the scripts of the two acts together, making it appear to "remember."

IV. The Cost of chatHistory: From Convenience to Bottleneck

But this paradigm has its own cost. As the conversation continues, the chatHistory array expands uncontrollably. The conversation history at round 10 might be 10 times the size of round 1. The token cost of LLM inference grows super-linearly with input length—a longer context means more attention calculations, higher GPU memory usage, slower response times, and most importantly—a higher API bill.

This raises a question: If token costs inflate infinitely with conversation growth, why can we chat for dozens of rounds in ChatGPT or Claude without noticing a significant slowdown?

The answer lies in the combination of an LRU (Least Recently Used) eviction strategy and a capacity limit. The client's context window has an implicit "capacity ceiling"—it could be the model's maximum context length (e.g., 128K tokens) or a more conservative threshold set at the application layer (e.g., 32K tokens). When the message history approaches this limit, the most "distant" messages are summarized, truncated, or directly discarded, keeping only the "working set" of recent interactions. It's like your brain—you don't remember what you had for lunch three days ago, but you remember what someone said to you five minutes ago, because "recent information" is still critical for completing the current task.

But this introduces a new dilemma: Which history can be discarded? Deleting a key constraint mentioned by the user in round 2 could cause the reply in round 15 to completely miss the mark. A simple LRU strategy cannot distinguish between "unnecessary chit-chat" and "core requirements that span the entire conversation."

V. From Prompt Engineering to Loop Engineering: A Three-Stage Cognitive Upgrade

It is precisely this dilemma that drives the continuous evolution of LLM interaction methods. The readme outlines a clear three-stage upgrade path:

Level 1: Prompt Engineering. This is the most primitive and intuitive approach—carefully designing system prompts, meticulously crafting every instruction, trying to make the model understand all requirements in a one-off prompt. It's like drawing a card: a good prompt increases the probability of drawing a gold card, but the result is fundamentally not precisely controllable. Its ceiling lies in the fact that—no matter how well the prompt is written—the model can only see the information from "this current moment."

Level 2: Context Engineering. From here, the focus shifts from "writing good prompts" to "managing context well." RAG (Retrieval-Augmented Generation) injects dynamically retrieved results from external knowledge bases into the context. MCP (Model Context Protocol) allows models to call external tools on demand to obtain real-time information. Skills encapsulate domain capabilities in a composable way. The core idea is: Rather than hoping the model "knows," ensure the model "sees."

Level 3: Loop Engineering. This is the current frontier—introducing a feedback loop on top of context engineering. Harness AI engineering extends single-shot inference into an autonomous cycle of "Execute -> Observe -> Reflect -> Re-execute." Each iteration is an independent, stateless LLM call, but the loop itself produces an emergent "sense of state"—just like a movie creates the illusion of motion from a sequence of still frames.

VI. Conclusion

Looking back, statelessness is not a flaw of LLMs but the cornerstone of LLM engineering. Precisely because each request is self-contained, dependency-free, and independently processable, we can build elastically scalable clusters on top of it, achieve fault-tolerant transfer, control token budgets, and stack increasingly intelligent system behaviors layer by layer on top of stateless primitives through context engineering and loop engineering.

Once you understand this, when you look at those dazzling AI frameworks and products, you'll find their underlying logic is actually the same—they are all doing the same thing: On top of stateless HTTP requests, precisely managing the matter of "what information to bring to the model." Whether it's a Chatbot's conversation history, RAG's retrieved snippets, an Agent's tool call results, or a Skill's domain instructions—essentially, they are all constructing a messages array that is more "complete" than the last one, and then initiating a new, independent, stateless HTTP request.

This is the ultimate secret of LLM engineering: The model doesn't need memory; you do.