跪拜 Guibai
← Back to the summary

Enterprise RAG Is a Retrieval Problem, Not an LLM Problem


theme: channing-cyan

If you've browsed technical articles about RAG on the market, you'll likely see this formula:

RAG = Vector Database + LLM API

This formula itself isn't wrong—but it describes a Demo, not a product.

When you actually need to implement RAG in an enterprise knowledge base scenario, you'll find that the things never appearing in Demos are the real engineering effort: How do documents get ingested? How do you split long documents? How do you extract keywords for Chinese retrieval? If vector search returns 20 items, which 5 are the most suitable to feed into the context? What happens if the rerank service goes down? How does the user know which original documents the answer references? This is just the "usable" level. To reach the "good" level, you also need to answer: What should the retrieval parameters be? How do you know if your parameter tuning is correct or making things worse?

This article is the first in a series. It doesn't focus on a single technical point but lays out the skeleton of the entire project—explaining what an enterprise-grade RAG pipeline that can go from 0 to 1 actually looks like, what problems each module solves, and how I connected them together.

Define the Boundaries First: What This Project Does and Doesn't Do

The scariest thing in a project isn't technical difficulty; it's scope creep. So before writing the first line of code, I drew three lines:

What it does:

What it doesn't do (at least not in this version):

Once this boundary is drawn, the scope is clear: One pipeline, one console, one evaluation closed loop. Let's break it down layer by layer.

Overall Architecture: One RAG Pipeline Through 8 Modules

First, a look at the project homepage. On the left is the navigation menu, on the right is the workspace panel, and at the top is a green health check light—proving the service is indeed running.

System Homepage

The menu bar you see corresponds exactly to each stage in the RAG main pipeline. Straightening out this pipeline, it looks roughly like this:

User uploads document → Tika parses plain text → Paragraph-prioritized chunking → Embedding vectorization
                                                 ↓
User asks question → Vector recall + Keyword recall → Optional Rerank → Neighbor Chunk completion
                                                 ↓
            Assemble context → LLM generates answer → Final response with citations

Each block in this pipeline corresponds to an independent Java Package in the backend:

Package Responsibility Core Class
document Document upload, async indexing, state machine management DocumentIndexService
parser File format parsing (Tika) TikaDocumentParserService
chunk Paragraph-prioritized splitting, supports sliding window for long paragraphs ParagraphTextChunker
embedding Calls OpenAI-compatible Embedding API OpenAiCompatibleEmbeddingService
vector Qdrant vector store read/write QdrantVectorStoreService
retrieval Hybrid retrieval orchestration: Vector + Keyword + Fusion RetrievalService
rerank Rerank service call, automatic degradation on failure HttpRerankService
chat Q&A orchestration: Retrieval → Context → LLM → Citations ChatService
evaluation Retrieval evaluation: Test case management + scoring RagEvaluationService
knowledgebase Knowledge base and its parameter configuration KnowledgeBaseService
audit Q&A log persistence RagQaLog

Each Package is split by domain, not by layer—the document package contains Controller, Service, and Entity, not the horizontal layering of "all Controllers in one package, all Services in another." The benefit is that when you modify a feature, you only need to jump within one package, not across six or seven packages.

The frontend routes correspond one-to-one with the backend Packages—not a coincidence, but a deliberate design:

// frontend/src/router/index.ts
{ path: "/workspace", component: () => import("@/views/WorkspaceView.vue"), meta: { title: "Workspace" } },
{ path: "/knowledge-bases", component: () => import("@/views/KnowledgeBaseView.vue"), meta: { title: "Knowledge Base" } },
{ path: "/documents", component: () => import("@/views/DocumentView.vue"), meta: { title: "Documents" } },
{ path: "/chunks", component: () => import("@/views/ChunkInspectorView.vue"), meta: { title: "Chunks" } },
{ path: "/settings", component: () => import("@/views/SettingsView.vue"), meta: { title: "Parameter Config" } },
{ path: "/chat", component: () => import("@/views/ChatView.vue"), meta: { title: "Q&A" } },
{ path: "/evaluation", component: () => import("@/views/EvaluationView.vue"), meta: { title: "Evaluation" } },

These 7 pages correspond to the 7 things an operator truly cares about in the RAG workflow: knowledge base management, document ingestion, chunk observability, parameter tuning, Q&A interaction, and effectiveness evaluation. Not "built because we can," but "the operator genuinely needs to see or adjust something at this stage."

Infrastructure: 6 Containers + 1 Spring Boot

To run an RAG system, Spring Boot alone isn't enough. These 6 components are indispensable:

Service Purpose Why It's Not Optional
MySQL 8.4 Document metadata, Chunks, Q&A logs, evaluation test cases Structured data needs a home
Qdrant Vector storage and retrieval Nearest neighbor search for Dense Vectors; MySQL can't do this
MinIO Original file storage Files shouldn't be stuffed into databases; this is common sense
Embedding Model Text-to-vector External APIs work too, but a local model has zero latency and zero cost
Reranker Model Retrieval result refinement Runs on CPU, speed is sufficient, accuracy improvement is noticeable
Nginx (Embedding/Reranker Proxy) API Authentication Model containers have no built-in auth mechanism; wrap an Nginx layer outside for Bearer Token verification

For the deployment side, I'll just paste one startup command:

docker compose --env-file .env up -d

This compose file does several things that are "productization necessities but never appear in Demos":

  1. Non-standard ports: MySQL doesn't use 3306, changed to 23306; Qdrant doesn't use 6333, changed to 26333. Reduces port conflicts and scan probability.
  2. Model containers not directly exposed: The reranker-model and embedding-model containers have no network exposure themselves; an Nginx layer wraps them for Authorization: Bearer <token> verification.
  3. Data persisted to the compose file directory: ./data/mysql, ./data/qdrant, ./data/minio—you can delete containers, recreate containers, and data won't be lost.
  4. All sensitive info via env: ${MINIO_SECRET_KEY}, ${QDRANT_API_KEY}, ${RERANK_API_KEY}—the env file goes into .gitignore.

Main Pipeline Breakdown: From Document Upload to Q&A Response

Let's walk through the main pipeline in order. No large code blocks, just expanding on key decision points.

1. Document Ingestion: A 5-State Async Pipeline

Document upload is not synchronous—parsing large files can take tens of seconds; making the HTTP request wait is unrealistic. So the upload action does only two things: the file lands in MinIO, a document record is created in the database (status = UPLOADED), and then it's thrown to a thread pool for async processing.

// DocumentIndexService.java —— Core Entry Point
public DocumentResponse uploadAndIndex(DocumentUploadRequest request) {
    StoredFile storedFile = fileStorageService.upload(file);
    RagDocument document = createDocument(file, storedFile, ...);
    submitIndexingTask(document.getId());  // Async
    return DocumentResponse.from(document);
}

The async pipeline has 5 state nodes:

UPLOADED → PARSING → CHUNKING → INDEXING → INDEXED
   ↓ Any node fails
FAILED (records failureReason)

A small but important design choice here: every state transition writes to the database in real-time. The benefit is that when the frontend polls the document status, it can see "which step it's at." The downside is one extra UPDATE. At this throughput level, saving one UPDATE is absolutely not worth it.

The Tika parsing stage uses tika-parsers-standard-package, capable of handling common formats like PDF, Word, PPT, HTML, and plain text. After parsing, a plain text string is obtained and enters the chunking phase.

2. Chunking: Paragraph-First, Not Blind Slicing

The chunking strategy determines the upper limit of retrieval quality. This project uses paragraph-prioritized chunking:

// ParagraphTextChunker.java —— Core Logic
for (String paragraph : normalizedText.split("\n\s*\n")) {
    if (paragraph.length() > maxChars) {
        splitLongParagraph(chunks, paragraph, maxChars, overlapChars);  // Sliding window for long paragraphs
    } else if (current.length() + paragraph.length() + 2 > maxChars) {
        flushCurrent(chunks, current);  // Current chunk is full, archive it
        current.append(paragraph);      // Start a new chunk with the new paragraph
    } else {
        current.append(paragraph);      // Add paragraph to the current chunk
    }
}

The logic is straightforward: use paragraphs as the minimum unit to assemble chunks; if it doesn't fit, start a new chunk. Only when a single paragraph exceeds maxChars (default 1000 characters) is sliding window splitting applied, with a window overlap of 150 characters.

Why do this? Because paragraphs in knowledge base documents naturally have semantic boundaries. If you forcibly splice the last two sentences of one paragraph with the first two sentences of the next, the generated vector is neither "like" the previous paragraph nor "like" the next, pleasing no one during retrieval.

3. Embedding: OpenAI-Compatible Protocol, Model Swappable at Will

The Embedding layer has no self-developed logic; it's just an HTTP call. The interface is compatible with OpenAI's /v1/embeddings format. You can fill MODEL_EMBEDDING_BASE_URL with DeepSeek's API, or http://embedding:80 for a local BGE model.

rag:
  model:
    embedding:
      base-url: ${MODEL_EMBEDDING_BASE_URL:}       # Supports any compatible service
      model: ${MODEL_EMBEDDING_MODEL:text-embedding-3-large}

Vector dimensions follow the model: using BGE base gives 768 dimensions; using text-embedding-3-large gives 3072 dimensions. The dimension must match when creating the Qdrant collection, so changing the model = changing the collection = rebuilding the index. This constraint is inherent to vector retrieval and unrelated to the architecture.

4. Hybrid Retrieval: Vector + Keyword + Fusion Sorting + Optional Rerank

This is the most worthwhile part of the entire project to expand upon. Pure vector retrieval has a fatal problem: it's insensitive to proper nouns, abbreviations, and codes. For example, if you search "HR-2024-003", the cosine similarity between this chunk and your query vector in vector space might not be high, because the Embedding model doesn't know your company's internal document numbering rules.

So the retrieval walks on two legs:

Left Leg—Vector Recall: Use question to generate query vector → Qdrant searches topK (default 20). This is semantic-level recall, broad coverage.

Right Leg—Keyword Recall: Extract keywords from question → MySQL full-text index search → If full-text index misses, degrade to LIKE. This is literal-level supplementary recall, specifically targeting proper nouns.

Keyword extraction includes Chinese adaptation:

// For tokens containing Chinese, do 4-character window splitting
// "核心存储有哪些" → "核心存储", "心存储有", "存储有哪些" ...
if (containsCjk(token) && token.length() > CJK_NGRAM_LENGTH) {
    for (int i = 0; i <= token.length() - CJK_NGRAM_LENGTH; i++) {
        keywords.add(token.substring(i, i + CJK_NGRAM_LENGTH));
    }
}

The motivation for this approach: Chinese doesn't have natural space-delimited word boundaries like English. If a user inputs "核心存储有哪些功能" and you don't n-gram it, the full-text index might not match a single word.

After merging the two result sets, each path has its own score: vector results carry a cosine score, keyword results get a fixed weak score (0.2). Fusion sorting takes the top K by total score descending.

If Rerank is enabled (RAG_RERANK_ENABLED=true), the fused candidate set is sent to the Reranker for refinement. The Reranker is a Cross-Encoder that scores the question and each candidate chunk together, with accuracy significantly higher than vector similarity. But it has one risk—what if the network fails?

if (Boolean.TRUE.equals(options.rerankEnabled()) && rerankService.available()) {
    try {
        return rerank(question, candidates, options);      // Refined ranking
    } catch (RuntimeException ex) {
        log.warn("Rerank failed, fallback to fused retrieval score");  // Auto-degrade to fusion score
    }
}
return sortByFusedScore(candidates, options);  // Degradation path

Key design: Rerank is an enhancement path, not the main pipeline. If it fails, the system falls back to fusion sorting and continues working, without affecting Q&A availability.

5. Neighbor Chunk Completion and Context Assembly

A piece of knowledge likely spans two chunks. For example, "The API key acquisition method is as follows:" is in the 5th chunk, and the actual steps are in the 6th chunk. If only the 5th is fed into the context, the LLM can only fabricate an answer.

So after retrieval results come out, for each hit chunk, expand N neighbors forward and backward (default 1 each):

int startIndex = Math.max(0, chunk.getChunkIndex() - options.neighborBefore());
int endIndex = chunk.getChunkIndex() + options.neighborAfter();
List<RagChunk> neighborChunks = chunkRepository
    .findByDocumentIdAndChunkIndexBetweenOrderByChunkIndexAsc(
        chunk.getDocument().getId(), startIndex, endIndex);

The completed chunk list is then truncated by context-max-chars (default 8000)—first come, first served; excess is discarded. This prevents the context from exceeding the LLM window while ensuring the most relevant chunks are definitely at the front.

The context assembly format is like this:

Citation 1, Document: "Employee Handbook v3", Fragment: 5
(Full content of the 5th chunk)

Citation 2, Document: "Employee Handbook v3", Fragment: 6
(Full content of the 6th chunk)

The prompt given to the LLM explicitly tells it "how to annotate citation sources," so the answer can include citation numbers.

6. Q&A Orchestration: Two Modes + Audit Log

ChatService provides three invocation methods:

Every Q&A session records an audit log (RagQaLog): question, answer, retrievedChunkIds, modelName, latencyMs. Doesn't record IP, doesn't record user—the first version for SMEs doesn't need these things.

Configuration Design: Global Defaults + Knowledge Base Override + Request-Level Override

Retrieval parameters are not hardcoded. Each knowledge base can have its own default parameters, and each request can temporarily override them. The priority is:

Request Parameters > Knowledge Base Config > application.yml Global Defaults
# application.yml —— Global Defaults
rag:
  retrieval:
    final-top-k: 5
    vector-top-k: 20
    keyword-top-k: 20
    rerank-enabled: false        # Default off, enable after configuring rerank service
    neighbor-enabled: true
    neighbor-before: 1
    neighbor-after: 1
    context-max-chars: 8000

This design changes "parameter tuning" from "change code → redeploy" to "adjust sliders on the frontend Settings page → click save → directly test the effect on the Chat page." The evaluation closed loop is thus made possible—you can run the same evaluation set under different parameter combinations and let the data speak, rather than relying on gut feeling.

Frontend: Not Just a "Shell"

The frontend uses Vue 3 + Naive UI + Pinia. After Vite builds, it's directly placed under src/main/resources/static/. A single Spring Boot JAR serves both frontend static resources and backend APIs. No need to deploy Nginx separately to host the frontend.

A few points worth mentioning:

  1. ConsoleLayout is a single-page layout, not multi-page navigation. Switching the left menu only changes the content in router-view; the knowledge base selection and health status are always visible.
  2. The ChunkInspector page can view all chunk content and index positions by document—if something goes wrong during the indexing phase, you don't guess, you look directly.
  3. The Settings page turns all retrieval parameters into visual controls (sliders, switches, dropdowns). Parameter changes take effect immediately, no restart needed.

Current Status and Next Steps

The main pipeline is fully operational:

The Q&A page and evaluation page still need product-level rework—the current features can run, but the interaction experience and data visualization are only at a "usable" level. The evaluation page needs more visual presentations (like nDCG curves, parameter comparison views), and the Q&A page needs better citation display and follow-up question experience.

The subsequent 11 articles in the series will unfold one by one: deployment, document ingestion, chunking strategy, Embedding technology selection, vector database practice, hybrid retrieval details, Rerank deployment and tuning, context assembly techniques, Q&A product design, evaluation system construction, and finally, an engineering retrospective.

Final Words

The core point this article wants to convey is really just one thing: The engineering complexity of enterprise RAG lies not in the LLM, but in retrieval. The large model API is the last mile of the RAG pipeline, but you have to get the first ninety-nine miles running smoothly for that last mile to be meaningful. The first ninety-nine miles include: how documents come in, how they are split, how they are retrieved, how retrieval results are refined and completed, and how the entire pipeline is observed and optimized.

This project is prepared for SMEs to "use the right architecture from day one"—not a Demo, but an engineering baseline from 0 to 1. You can swap models, vector databases, or frontend frameworks on this baseline, but the pipeline skeleton and module splitting logic are reusable.