Enterprise RAG Is a Retrieval Problem, Not an LLM Problem

theme: channing-cyan

If you've browsed technical articles about RAG on the market, you'll likely see this formula:

RAG = Vector Database + LLM API

This formula itself isn't wrong—but it describes a Demo, not a product.

When you actually need to implement RAG in an enterprise knowledge base scenario, you'll find that the things never appearing in Demos are the real engineering effort: How do documents get ingested? How do you split long documents? How do you extract keywords for Chinese retrieval? If vector search returns 20 items, which 5 are the most suitable to feed into the context? What happens if the rerank service goes down? How does the user know which original documents the answer references? This is just the "usable" level. To reach the "good" level, you also need to answer: What should the retrieval parameters be? How do you know if your parameter tuning is correct or making things worse?

This article is the first in a series. It doesn't focus on a single technical point but lays out the skeleton of the entire project—explaining what an enterprise-grade RAG pipeline that can go from 0 to 1 actually looks like, what problems each module solves, and how I connected them together.

Define the Boundaries First: What This Project Does and Doesn't Do

The scariest thing in a project isn't technical difficulty; it's scope creep. So before writing the first line of code, I drew three lines:

What it does:

Complete RAG main pipeline: Document Upload → Parsing → Chunking → Vectorization → Hybrid Retrieval → Optional Rerank → Context Assembly → LLM Q&A
Supporting management console: Knowledge base management, document lifecycle, chunk-level observability, retrieval parameter tuning
Retrieval evaluation closed loop: Can run evaluation sets, see what each retrieval hit, compare actual effects of different parameters

What it doesn't do (at least not in this version):

Multi-tenancy and permission systems—In an SME scenario, one knowledge base is usually used by one team. Abstracting tenants too early only adds unnecessary complexity.
Complex document preprocessing pipelines—First, use Apache Tika for general parsing; it's sufficient. Things like table extraction, image OCR, hierarchical structure preservation are second-phase tasks.
Self-developed Embedding model—Directly interface with OpenAI-compatible APIs. You can connect DeepSeek, Tongyi Qianwen, or a locally deployed BGE model; I won't lock you in.

Once this boundary is drawn, the scope is clear: One pipeline, one console, one evaluation closed loop. Let's break it down layer by layer.

Overall Architecture: One RAG Pipeline Through 8 Modules

First, a look at the project homepage. On the left is the navigation menu, on the right is the workspace panel, and at the top is a green health check light—proving the service is indeed running.

System Homepage

The menu bar you see corresponds exactly to each stage in the RAG main pipeline. Straightening out this pipeline, it looks roughly like this:

User uploads document → Tika parses plain text → Paragraph-prioritized chunking → Embedding vectorization
                                                 ↓
User asks question → Vector recall + Keyword recall → Optional Rerank → Neighbor Chunk completion
                                                 ↓
            Assemble context → LLM generates answer → Final response with citations

Each block in this pipeline corresponds to an independent Java Package in the backend:

Package	Responsibility	Core Class
`document`	Document upload, async indexing, state machine management	`DocumentIndexService`
`parser`	File format parsing (Tika)	`TikaDocumentParserService`
`chunk`	Paragraph-prioritized splitting, supports sliding window for long paragraphs	`ParagraphTextChunker`
`embedding`	Calls OpenAI-compatible Embedding API	`OpenAiCompatibleEmbeddingService`
`vector`	Qdrant vector store read/write	`QdrantVectorStoreService`
`retrieval`	Hybrid retrieval orchestration: Vector + Keyword + Fusion	`RetrievalService`
`rerank`	Rerank service call, automatic degradation on failure	`HttpRerankService`
`chat`	Q&A orchestration: Retrieval → Context → LLM → Citations	`ChatService`
`evaluation`	Retrieval evaluation: Test case management + scoring	`RagEvaluationService`
`knowledgebase`	Knowledge base and its parameter configuration	`KnowledgeBaseService`
`audit`	Q&A log persistence	`RagQaLog`

Each Package is split by domain, not by layer—the document package contains Controller, Service, and Entity, not the horizontal layering of "all Controllers in one package, all Services in another." The benefit is that when you modify a feature, you only need to jump within one package, not across six or seven packages.

The frontend routes correspond one-to-one with the backend Packages—not a coincidence, but a deliberate design:

// frontend/src/router/index.ts
{ path: "/workspace", component: () => import("@/views/WorkspaceView.vue"), meta: { title: "Workspace" } },
{ path: "/knowledge-bases", component: () => import("@/views/KnowledgeBaseView.vue"), meta: { title: "Knowledge Base" } },
{ path: "/documents", component: () => import("@/views/DocumentView.vue"), meta: { title: "Documents" } },
{ path: "/chunks", component: () => import("@/views/ChunkInspectorView.vue"), meta: { title: "Chunks" } },
{ path: "/settings", component: () => import("@/views/SettingsView.vue"), meta: { title: "Parameter Config" } },
{ path: "/chat", component: () => import("@/views/ChatView.vue"), meta: { title: "Q&A" } },
{ path: "/evaluation", component: () => import("@/views/EvaluationView.vue"), meta: { title: "Evaluation" } },

These 7 pages correspond to the 7 things an operator truly cares about in the RAG workflow: knowledge base management, document ingestion, chunk observability, parameter tuning, Q&A interaction, and effectiveness evaluation. Not "built because we can," but "the operator genuinely needs to see or adjust something at this stage."

Infrastructure: 6 Containers + 1 Spring Boot

To run an RAG system, Spring Boot alone isn't enough. These 6 components are indispensable:

Service	Purpose	Why It's Not Optional
MySQL 8.4	Document metadata, Chunks, Q&A logs, evaluation test cases	Structured data needs a home
Qdrant	Vector storage and retrieval	Nearest neighbor search for Dense Vectors; MySQL can't do this
MinIO	Original file storage	Files shouldn't be stuffed into databases; this is common sense
Embedding Model	Text-to-vector	External APIs work too, but a local model has zero latency and zero cost
Reranker Model	Retrieval result refinement	Runs on CPU, speed is sufficient, accuracy improvement is noticeable
Nginx (Embedding/Reranker Proxy)	API Authentication	Model containers have no built-in auth mechanism; wrap an Nginx layer outside for Bearer Token verification

For the deployment side, I'll just paste one startup command:

docker compose --env-file .env up -d

This compose file does several things that are "productization necessities but never appear in Demos":

Non-standard ports: MySQL doesn't use 3306, changed to 23306; Qdrant doesn't use 6333, changed to 26333. Reduces port conflicts and scan probability.
Model containers not directly exposed: The reranker-model and embedding-model containers have no network exposure themselves; an Nginx layer wraps them for Authorization: Bearer <token> verification.
Data persisted to the compose file directory: ./data/mysql, ./data/qdrant, ./data/minio—you can delete containers, recreate containers, and data won't be lost.
All sensitive info via env: ${MINIO_SECRET_KEY}, ${QDRANT_API_KEY}, ${RERANK_API_KEY}—the env file goes into .gitignore.

Main Pipeline Breakdown: From Document Upload to Q&A Response

Let's walk through the main pipeline in order. No large code blocks, just expanding on key decision points.

1. Document Ingestion: A 5-State Async Pipeline

Document upload is not synchronous—parsing large files can take tens of seconds; making the HTTP request wait is unrealistic. So the upload action does only two things: the file lands in MinIO, a document record is created in the database (status = UPLOADED), and then it's thrown to a thread pool for async processing.

// DocumentIndexService.java —— Core Entry Point
public DocumentResponse uploadAndIndex(DocumentUploadRequest request) {
    StoredFile storedFile = fileStorageService.upload(file);
    RagDocument document = createDocument(file, storedFile, ...);
    submitIndexingTask(document.getId());  // Async
    return DocumentResponse.from(document);
}

The async pipeline has 5 state nodes:

UPLOADED → PARSING → CHUNKING → INDEXING → INDEXED
   ↓ Any node fails
FAILED (records failureReason)

A small but important design choice here: every state transition writes to the database in real-time. The benefit is that when the frontend polls the document status, it can see "which step it's at." The downside is one extra UPDATE. At this throughput level, saving one UPDATE is absolutely not worth it.

The Tika parsing stage uses tika-parsers-standard-package, capable of handling common formats like PDF, Word, PPT, HTML, and plain text. After parsing, a plain text string is obtained and enters the chunking phase.

2. Chunking: Paragraph-First, Not Blind Slicing

The chunking strategy determines the upper limit of retrieval quality. This project uses paragraph-prioritized chunking:

// ParagraphTextChunker.java —— Core Logic
for (String paragraph : normalizedText.split("\n\s*\n")) {
    if (paragraph.length() > maxChars) {
        splitLongParagraph(chunks, paragraph, maxChars, overlapChars);  // Sliding window for long paragraphs
    } else if (current.length() + paragraph.length() + 2 > maxChars) {
        flushCurrent(chunks, current);  // Current chunk is full, archive it
        current.append(paragraph);      // Start a new chunk with the new paragraph
    } else {
        current.append(paragraph);      // Add paragraph to the current chunk
    }
}

The logic is straightforward: use paragraphs as the minimum unit to assemble chunks; if it doesn't fit, start a new chunk. Only when a single paragraph exceeds maxChars (default 1000 characters) is sliding window splitting applied, with a window overlap of 150 characters.

Why do this? Because paragraphs in knowledge base documents naturally have semantic boundaries. If you forcibly splice the last two sentences of one paragraph with the first two sentences of the next, the generated vector is neither "like" the previous paragraph nor "like" the next, pleasing no one during retrieval.

3. Embedding: OpenAI-Compatible Protocol, Model Swappable at Will

The Embedding layer has no self-developed logic; it's just an HTTP call. The interface is compatible with OpenAI's /v1/embeddings format. You can fill MODEL_EMBEDDING_BASE_URL with DeepSeek's API, or http://embedding:80 for a local BGE model.

rag:
  model:
    embedding:
      base-url: ${MODEL_EMBEDDING_BASE_URL:}       # Supports any compatible service
      model: ${MODEL_EMBEDDING_MODEL:text-embedding-3-large}

Vector dimensions follow the model: using BGE base gives 768 dimensions; using text-embedding-3-large gives 3072 dimensions. The dimension must match when creating the Qdrant collection, so changing the model = changing the collection = rebuilding the index. This constraint is inherent to vector retrieval and unrelated to the architecture.

4. Hybrid Retrieval: Vector + Keyword + Fusion Sorting + Optional Rerank

This is the most worthwhile part of the entire project to expand upon. Pure vector retrieval has a fatal problem: it's insensitive to proper nouns, abbreviations, and codes. For example, if you search "HR-2024-003", the cosine similarity between this chunk and your query vector in vector space might not be high, because the Embedding model doesn't know your company's internal document numbering rules.

So the retrieval walks on two legs:

Left Leg—Vector Recall: Use question to generate query vector → Qdrant searches topK (default 20). This is semantic-level recall, broad coverage.

Right Leg—Keyword Recall: Extract keywords from question → MySQL full-text index search → If full-text index misses, degrade to LIKE. This is literal-level supplementary recall, specifically targeting proper nouns.

Keyword extraction includes Chinese adaptation:

// For tokens containing Chinese, do 4-character window splitting
// "核心存储有哪些" → "核心存储", "心存储有", "存储有哪些" ...
if (containsCjk(token) && token.length() > CJK_NGRAM_LENGTH) {
    for (int i = 0; i <= token.length() - CJK_NGRAM_LENGTH; i++) {
        keywords.add(token.substring(i, i + CJK_NGRAM_LENGTH));
    }
}

The motivation for this approach: Chinese doesn't have natural space-delimited word boundaries like English. If a user inputs "核心存储有哪些功能" and you don't n-gram it, the full-text index might not match a single word.

After merging the two result sets, each path has its own score: vector results carry a cosine score, keyword results get a fixed weak score (0.2). Fusion sorting takes the top K by total score descending.

If Rerank is enabled (RAG_RERANK_ENABLED=true), the fused candidate set is sent to the Reranker for refinement. The Reranker is a Cross-Encoder that scores the question and each candidate chunk together, with accuracy significantly higher than vector similarity. But it has one risk—what if the network fails?

if (Boolean.TRUE.equals(options.rerankEnabled()) && rerankService.available()) {
    try {
        return rerank(question, candidates, options);      // Refined ranking
    } catch (RuntimeException ex) {
        log.warn("Rerank failed, fallback to fused retrieval score");  // Auto-degrade to fusion score
    }
}
return sortByFusedScore(candidates, options);  // Degradation path

Key design: Rerank is an enhancement path, not the main pipeline. If it fails, the system falls back to fusion sorting and continues working, without affecting Q&A availability.

5. Neighbor Chunk Completion and Context Assembly

A piece of knowledge likely spans two chunks. For example, "The API key acquisition method is as follows:" is in the 5th chunk, and the actual steps are in the 6th chunk. If only the 5th is fed into the context, the LLM can only fabricate an answer.

So after retrieval results come out, for each hit chunk, expand N neighbors forward and backward (default 1 each):

int startIndex = Math.max(0, chunk.getChunkIndex() - options.neighborBefore());
int endIndex = chunk.getChunkIndex() + options.neighborAfter();
List<RagChunk> neighborChunks = chunkRepository
    .findByDocumentIdAndChunkIndexBetweenOrderByChunkIndexAsc(
        chunk.getDocument().getId(), startIndex, endIndex);

The completed chunk list is then truncated by context-max-chars (default 8000)—first come, first served; excess is discarded. This prevents the context from exceeding the LLM window while ensuring the most relevant chunks are definitely at the front.

The context assembly format is like this:

Citation 1, Document: "Employee Handbook v3", Fragment: 5
(Full content of the 5th chunk)

Citation 2, Document: "Employee Handbook v3", Fragment: 6
(Full content of the 6th chunk)

The prompt given to the LLM explicitly tells it "how to annotate citation sources," so the answer can include citation numbers.

6. Q&A Orchestration: Two Modes + Audit Log

ChatService provides three invocation methods:

Synchronous Q&A (POST /api/chat): Retrieval → LLM Generation → One-time return of answer + citations
Streaming Q&A (POST /api/chat/stream): SSE pushes tokens, runs on Java 21 virtual threads, transactions manually managed with TransactionTemplate to avoid long SSE transactions
Retrieval Only (POST /api/chat/retrieve): Only runs retrieval, doesn't call LLM, returns which chunks were hit

Every Q&A session records an audit log (RagQaLog): question, answer, retrievedChunkIds, modelName, latencyMs. Doesn't record IP, doesn't record user—the first version for SMEs doesn't need these things.

Configuration Design: Global Defaults + Knowledge Base Override + Request-Level Override

Retrieval parameters are not hardcoded. Each knowledge base can have its own default parameters, and each request can temporarily override them. The priority is:

Request Parameters > Knowledge Base Config > application.yml Global Defaults

# application.yml —— Global Defaults
rag:
  retrieval:
    final-top-k: 5
    vector-top-k: 20
    keyword-top-k: 20
    rerank-enabled: false        # Default off, enable after configuring rerank service
    neighbor-enabled: true
    neighbor-before: 1
    neighbor-after: 1
    context-max-chars: 8000

This design changes "parameter tuning" from "change code → redeploy" to "adjust sliders on the frontend Settings page → click save → directly test the effect on the Chat page." The evaluation closed loop is thus made possible—you can run the same evaluation set under different parameter combinations and let the data speak, rather than relying on gut feeling.

Frontend: Not Just a "Shell"

The frontend uses Vue 3 + Naive UI + Pinia. After Vite builds, it's directly placed under src/main/resources/static/. A single Spring Boot JAR serves both frontend static resources and backend APIs. No need to deploy Nginx separately to host the frontend.

A few points worth mentioning:

ConsoleLayout is a single-page layout, not multi-page navigation. Switching the left menu only changes the content in router-view; the knowledge base selection and health status are always visible.
The ChunkInspector page can view all chunk content and index positions by document—if something goes wrong during the indexing phase, you don't guess, you look directly.
The Settings page turns all retrieval parameters into visual controls (sliders, switches, dropdowns). Parameter changes take effect immediately, no restart needed.

Current Status and Next Steps

The main pipeline is fully operational:

✅ Document Upload → Parsing → Chunking → Embedding → Indexing
✅ Hybrid Retrieval (Vector + Keyword)
✅ Rerank Refinement (Optional, degradation available)
✅ Neighbor Chunk Context Completion
✅ Synchronous + Streaming LLM Q&A
✅ Retrieval Evaluation Framework
✅ Knowledge Base-Level Parameter Configuration

The Q&A page and evaluation page still need product-level rework—the current features can run, but the interaction experience and data visualization are only at a "usable" level. The evaluation page needs more visual presentations (like nDCG curves, parameter comparison views), and the Q&A page needs better citation display and follow-up question experience.

The subsequent 11 articles in the series will unfold one by one: deployment, document ingestion, chunking strategy, Embedding technology selection, vector database practice, hybrid retrieval details, Rerank deployment and tuning, context assembly techniques, Q&A product design, evaluation system construction, and finally, an engineering retrospective.

Final Words

The core point this article wants to convey is really just one thing: The engineering complexity of enterprise RAG lies not in the LLM, but in retrieval. The large model API is the last mile of the RAG pipeline, but you have to get the first ninety-nine miles running smoothly for that last mile to be meaningful. The first ninety-nine miles include: how documents come in, how they are split, how they are retrieved, how retrieval results are refined and completed, and how the entire pipeline is observed and optimized.

This project is prepared for SMEs to "use the right architecture from day one"—not a Demo, but an engineering baseline from 0 to 1. You can swap models, vector databases, or frontend frameworks on this baseline, but the pipeline skeleton and module splitting logic are reusable.