AIGC

Enterprise RAG Is a Retrieval Problem, Not an LLM Problem

By 一只牛博 · Jul 2, 2026

Read original on juejin.cn ↗ Google Translate ↗ Alt translation

RAG systems that work in a demo collapse under real documents, Chinese text, and production traffic. This architecture isolates the retrieval pipeline as the thing to get right, with degradation paths and observability built in from day one, so a small team can ship something that doesn't silently fail.

Summary

Most RAG demos stop at vector database plus LLM API. A real enterprise knowledge-base system lives or dies on the retrieval pipeline: document parsing, chunking strategy, hybrid search, rerank fallback, and context assembly. This architecture walks through eight backend modules and seven frontend views that turn those hidden engineering problems into observable, tunable components.

The retrieval layer combines dense vector recall with Chinese-aware keyword extraction and n-gram tokenization, then fuses scores before an optional Cross-Encoder rerank step. If the reranker goes down, the system degrades to fusion sorting without breaking the Q&A flow. Every chunk hit also pulls in neighboring chunks so cross-boundary knowledge isn't lost before context truncation.

A parameter priority system (request > knowledge base > global defaults) plus a built-in evaluation framework means tuning is done with sliders and scored against a test suite, not by gut feeling. The whole stack ships as one Spring Boot JAR with six Dockerized infrastructure services, all configured through a single compose file with non-standard ports and Nginx-auth-wrapped model containers.

Takeaways

— Document ingestion runs as a five-state async pipeline (UPLOADED → PARSING → CHUNKING → INDEXING → INDEXED) with real-time status writes so the frontend always knows where a document is stuck.

— Chunking uses paragraph-first splitting: paragraphs are the minimum unit, and only paragraphs exceeding 1000 characters get sliding-window cuts with 150-character overlap.

— Chinese keyword extraction applies 4-character n-gram tokenization because Chinese lacks natural word boundaries, preventing full-text index misses on phrases like "核心存储有哪些".

— Hybrid retrieval fuses vector cosine scores with a fixed weak score (0.2) for keyword hits, then sorts by total score before an optional Cross-Encoder rerank step.

— Rerank is an enhancement path, not the main pipeline; if the rerank service fails, the system falls back to fusion sorting automatically without affecting Q&A availability.

— Each retrieved chunk pulls in neighboring chunks (default ±1) before context assembly, so knowledge split across chunk boundaries isn't lost.

— Context truncation uses first-come-first-served up to 8000 characters, ensuring the most relevant chunks appear first and the LLM window isn't exceeded.

— Retrieval parameters follow a three-tier override chain: request parameters beat knowledge-base config, which beats application.yml global defaults.

— All six infrastructure services (MySQL, Qdrant, MinIO, Embedding model, Reranker model, Nginx proxy) run from a single docker compose file with non-standard ports and Bearer-token auth on model containers.

— The frontend ships inside the Spring Boot JAR under static/ resources, with a single-page ConsoleLayout where knowledge-base selection and health status persist across all views.

Conclusions

Enterprise RAG complexity concentrates in retrieval, not generation. The LLM API is the last mile; the first ninety-nine miles are document parsing, chunking, hybrid search, rerank, and context assembly.

Paragraph-first chunking preserves semantic boundaries that fixed-length splitting destroys. A chunk that mixes the tail of one paragraph with the head of another pleases neither during vector search.

Chinese text breaks naive keyword extraction. Without n-gram tokenization, full-text indexes miss internal document codes and compound phrases entirely, making hybrid search mandatory rather than optional.

Treating rerank as an enhancement path rather than a required step is a production-hardening move that most demos skip. A crashed reranker shouldn't take down Q&A.

Neighbor chunk completion is a cheap fix for the boundary problem that doesn't require smarter chunking. Pulling ±1 chunks around each hit recovers cross-boundary knowledge without complex overlap strategies.

Real-time state writes during async indexing cost one extra UPDATE per transition but eliminate the support burden of 'is my document done yet?' questions.

Parameter priority chains turn tuning from a redeploy cycle into a slider-and-save workflow, which makes A/B testing retrieval configs against an evaluation set practical for a small team.

Shipping the frontend inside the JAR and wrapping model containers behind Nginx auth are productization details that cost little upfront but remove entire classes of deployment and security headaches.

Concepts & terms

Hybrid Retrieval

Combining dense vector search (semantic similarity) with sparse keyword search (exact term matching), then fusing the result sets by score. Vector search misses proper nouns and codes; keyword search misses paraphrases. Together they cover both.

Cross-Encoder Reranker

A model that takes a query and a candidate document together as input and outputs a relevance score, rather than encoding them separately like a bi-encoder (Embedding model). Much more accurate but too slow to run over the full corpus, so it's applied only to the top-K candidates from first-stage retrieval.

Paragraph-First Chunking

A splitting strategy that treats paragraphs as atomic units. Chunks are built by appending whole paragraphs until a character limit is reached; only a single paragraph longer than the limit gets sliding-window cuts. This preserves the natural semantic boundaries that documents already have.

N-gram Tokenization for Chinese

Since Chinese text has no spaces between words, a full-text index needs help. Sliding a window of N characters (e.g., 4) across the text generates overlapping substrings that act as indexable tokens, so phrases like "核心存储" can be matched even without a dedicated Chinese word segmenter.

Neighbor Chunk Completion

After retrieval identifies relevant chunks, the system also fetches the chunks immediately before and after each hit in the original document. This recovers context that spans chunk boundaries without requiring smarter splitting or larger chunk sizes.

Async Indexing State Machine

Document processing runs through discrete states (UPLOADED, PARSING, CHUNKING, INDEXING, INDEXED, FAILED) on a background thread pool. Each state transition writes to the database so the UI can show progress without polling the file system or parsing service.

Source: juejin.cn ↗ Google Translate ↗ Backup ↗