Enterprise RAG Is a Retrieval Problem, Not an LLM Problem
RAG systems that work in a demo collapse under real documents, Chinese text, and production traffic. This architecture isolates the retrieval pipeline as the thing to get right, with degradation paths and observability built in from day one, so a small team can ship something that doesn't silently fail.
Most RAG demos stop at vector database plus LLM API. A real enterprise knowledge-base system lives or dies on the retrieval pipeline: document parsing, chunking strategy, hybrid search, rerank fallback, and context assembly. This architecture walks through eight backend modules and seven frontend views that turn those hidden engineering problems into observable, tunable components.
The retrieval layer combines dense vector recall with Chinese-aware keyword extraction and n-gram tokenization, then fuses scores before an optional Cross-Encoder rerank step. If the reranker goes down, the system degrades to fusion sorting without breaking the Q&A flow. Every chunk hit also pulls in neighboring chunks so cross-boundary knowledge isn't lost before context truncation.
A parameter priority system (request > knowledge base > global defaults) plus a built-in evaluation framework means tuning is done with sliders and scored against a test suite, not by gut feeling. The whole stack ships as one Spring Boot JAR with six Dockerized infrastructure services, all configured through a single compose file with non-standard ports and Nginx-auth-wrapped model containers.
Enterprise RAG complexity concentrates in retrieval, not generation. The LLM API is the last mile; the first ninety-nine miles are document parsing, chunking, hybrid search, rerank, and context assembly.
Paragraph-first chunking preserves semantic boundaries that fixed-length splitting destroys. A chunk that mixes the tail of one paragraph with the head of another pleases neither during vector search.
Chinese text breaks naive keyword extraction. Without n-gram tokenization, full-text indexes miss internal document codes and compound phrases entirely, making hybrid search mandatory rather than optional.
Treating rerank as an enhancement path rather than a required step is a production-hardening move that most demos skip. A crashed reranker shouldn't take down Q&A.
Neighbor chunk completion is a cheap fix for the boundary problem that doesn't require smarter chunking. Pulling ±1 chunks around each hit recovers cross-boundary knowledge without complex overlap strategies.
Real-time state writes during async indexing cost one extra UPDATE per transition but eliminate the support burden of 'is my document done yet?' questions.
Parameter priority chains turn tuning from a redeploy cycle into a slider-and-save workflow, which makes A/B testing retrieval configs against an evaluation set practical for a small team.
Shipping the frontend inside the JAR and wrapping model containers behind Nginx auth are productization details that cost little upfront but remove entire classes of deployment and security headaches.