A 270-Line Node Server That Reads Your Private Codebase, Built in an Afternoon
1. The Reason: Drawing Pages All Day, Soon to Be Replaced?
Earlier this year, Cursor could write components, and GitHub Copilot could auto-complete. A company leader said in a meeting: "Can our GitLab MRs be automatically reviewed by AI?"
After the meeting, I sat at my desk and thought about two things:
- No matter how well you draw pages, the incremental market will eventually be eaten by AI.
- But the last mile of AI implementation—rendering UI, typewriter effects, streaming transmission, connecting to DevOps pipelines—is all frontend work.
So I decided: instead of chasing the next frontend framework, I'll use my most familiar tool, JavaScript, to build an AI assistant that can read the private documentation of our four projects for the team.
2. See the Effect First (Simpler Than You Think)
One afternoon, npm install + a few lines of code, and the result:
📚 Local Knowledge Base Mode:
👨💻 Q: What is the difference between the E1 and E2 APIs?
🤖 A (📚 Local Docs · 4 snippets):
Dimension E1 (legacy) E2 (new) Prefix /api/e1/api/e2Success Indicator status: 0code: 200Ajax Factory Ajax()createE2Ajax()
——This is a real convention from my project's AGENTS.md, something ChatGPT could never answer.
🌐 Web Search Mode (auto-switches when local docs miss):
👨💻 Q: What's the weather like in Shenzhen today?
🤖 A (🌐 Web Search · 4 results): Shenzhen today is cloudy turning sunny, temperature 28-34°C...
——This step integrates the Bocha AI search engine, automatically falling back to the web when local hits are insufficient.
A whole page of code just runs, without depending on any external database, vector service, or Python environment.
3. Minimalist Architecture: If It Can Be One File, Don't Open Another Port
Question →
│
├─ Identify if it's a local question? (Contains pdk/core/equipment/workflow keywords?)
│ ├─ Yes → RAG retrieves local docs/*.md → LLM judges if snippets are relevant?
│ │ ├─ Relevant → Answer using local docs
│ │ └─ Not relevant → Web search
│ └─ No → Web search
│
└─ Stream output + Typewriter cursor + Markdown rendering
Four tech stacks all run in a single server.js (~270 lines):
| Layer | Tech | Role |
|---|---|---|
| Model | DeepSeek (via LangChain) | Understands questions, generates answers |
| Vectorization | transformers.js + BGE-small-zh | Converts docs to vectors locally, no API calls, no cost |
| RAG | MemoryVectorStore | In-memory vector store, retrieves the most relevant doc snippets |
| Web | Bocha Web Search API | Auto web search when local docs miss |
| Delivery | Node native http + stream chunked | Typewriter effect + Markdown rendering |
No Chroma/Pinecone, no Docker, no Python virtual environment.
4. Core Code Breakdown (Just the Essentials)
4.1 Local Embedding: $0, 0 API, Pure JS
The most hardcore and satisfying part of the whole system—I didn't use the OpenAI Embedding API, but an npm package called @huggingface/transformers, running the BGE-small-zh model directly in a local Node process:
import { pipeline } from '@huggingface/transformers';
const pipe = await pipeline('feature-extraction', 'Xenova/bge-small-zh-v1.5', { dtype: 'fp32' });
const vec = await pipe('The difference between E1 and E2 APIs', { pooling: 'mean', normalize: true });
// vec.data = Float32Array[384] ← This text became a 384-dimensional vector
Then wrapped as a LangChain Embeddings interface:
class LocalEmbeddings {
async embedQuery(text) { return Array.from((await this.pipe(text)).data); }
async embedDocuments(texts) { return Promise.all(texts.map(t => this.embedQuery(t))); }
}
The first download of the model is ~100MB, after that it starts instantly. Calculated for 10,000 calls, OpenAI Embedding would cost roughly tens of dollars—this solution: zero.
⚠️ Pitfall: huggingface.co is blocked in China. Solved using the hf-mirror.com mirror:
import { env } from '@huggingface/transformers'; env.remoteHost = 'https://hf-mirror.com';
4.2 Complete RAG Flow (Read Docs → Chunk → Vectorize → Retrieve → Feed LLM)
// 1. Read all .md files under the docs/ directory
const files = readdirSync('./docs').filter(f => extname(f) === '.md');
const raw = files.map(f => readFileSync(join('./docs', f), 'utf-8')).join('\n\n');
// 2. Chunk (500 chars per chunk, 80 char overlap)
const splitter = new RecursiveCharacterTextSplitter({ chunkSize: 500, chunkOverlap: 80 });
const docs = await splitter.splitDocuments([new Document({ pageContent: raw })]);
// 3. Vectorize + store in memory
const store = await MemoryVectorStore.fromDocuments(docs, new LocalEmbeddings());
// 4. Retrieve the 4 most relevant chunks when asking a question
const hits = await store.similaritySearchWithScore(question, 4);
// 5. Feed to DeepSeek
const context = hits.map(([doc]) => doc.pageContent).join('\n\n');
const answer = await model.invoke([
{ role: 'system', content: `Answer only using the following documents:\n\n${context}` },
{ role: 'user', content: question }
]);
Once these five steps run through, your AI assistant can answer questions using your private documentation. The knowledge base currently holds the AGENTS.md files from our team's four projects: pdk, core, equipment, and workflow, totaling about 30KB of documents.
4.3 Dual-Path Retrieval: Local First, LLM Judgment as Fallback, Web Search Only When Unreliable
Initially, I took a shortcut using a vector similarity threshold—topScore > 0.5 counted as a hit. As a result, for a question like "today's international crude oil price," which has nothing to do with code, BGE-small gave a similarity score of 0.4...
Switched to letting the LLM judge for itself:
const checkChunks = hits.map(([doc]) => doc.pageContent.slice(0, 250)).join('\n---\n');
const check = await model.invoke([
{ role: 'system', content: 'Strictly judge: Can these snippets practically answer the user\'s question? If unsure, answer "cannot". Only reply can or cannot.' },
{ role: 'user', content: `Question: ${q}\n\nSnippets:\n${checkChunks}` }
]);
if (check.content.includes('can')) {
/* Use local docs */
} else {
/* Use Bocha web search */
}
Ten times more reliable than a vector threshold. The LLM itself knows that "international crude oil" has nothing to do with the code conventions in AGENTS.md.
4.4 Streaming Output + Typewriter Cursor + Markdown Rendering
Three details that elevate the experience from "usable" to "pleasant":
// Backend: DeepSeek stream
const stream = await model.stream(messages);
res.writeHead(200, { 'Transfer-Encoding': 'chunked' });
res.write(`SOURCE:${source}|${count}\n`);
for await (const chunk of stream) {
if (chunk.content) res.write(chunk.content);
}
res.end();
// Frontend: ReadableStream appends character by character
const reader = res.body.getReader();
while (true) {
const { done, value } = await reader.read();
if (done) break;
raw += decoder.decode(value, { stream: true });
body.textContent = raw; // Characters pop out one by one
}
body.innerHTML = marked.parse(raw); // Render Markdown after completion
Paired with a CSS blinking cursor:
@keyframes blink { 0%,100% { opacity: 1 } 50% { opacity: 0 } }
.streaming::after { content: "|"; animation: blink .8s infinite; color: #7ee787; }
Characters pop out one by one, a green cursor blinks at the end—this is the "human touch" frontend adds to AI.
5. Pitfalls Encountered (This Saves Your Time)
| Pitfall | Symptom | Solution |
|---|---|---|
| huggingface blocked | transformers.js model download times out | Set env.remoteHost = 'https://hf-mirror.com' |
| Node v25 undici ignores HTTP_PROXY | Proxy set but fetch still connects directly | Use mirror instead of proxy, or setGlobalDispatcher |
| Vector threshold unreliable | Unrelated questions get high scores | Switch to LLM judging relevance |
| DeepSeek key with quotes | .env with KEY='sk-…' → Auth fails |
No quotes: KEY=sk-… |
| Nested template strings | JS nested ${} inside HTML causes errors |
Use concatenation instead of templates for the outer layer |
6. Why Frontend, Not Python?
Because the last mile of AI implementation is all frontend work:
- Typewriter effect, Markdown rendering → Frontend
- Streaming data transmission (SSE / chunked transfer) → Frontend
- Connecting AI into GitLab CI, Feishu Bot, DevOps → Frontend
Python can call APIs, write RAG, run models—but making AI "smoothly usable" is frontend.
My advice: stop grinding on the next frontend wheel. You are already a JS expert, LangChain has a complete JS SDK, DeepSeek has an OpenAI-compatible interface—you just need one weekend.
7. Next Steps
This setup is currently running in our internal demos. Next steps planned:
- Connect to GitLab API, auto-pull diff for each MR → RAG retrieves project conventions → AI generates review report
- Expand to Feishu Bot, team members can ask by @ mentioning it
- Feed in Swagger JSON too, "What fields does this interface return?" answered in seconds
Source Code
Generate it yourself using AI if needed.
Directory structure:
ai-share-demo/
├── server.js # Main program (~270 lines, contains all features)
├── step1-hello.js # Demo 1: 5 lines of JS to call DeepSeek
├── step2-structured.js # Demo 2: Zod structured JSON output
├── step3-rag.js # Demo 3: RAG standalone CLI version
├── docs/ # Knowledge base docs (add .md to expand knowledge)
│ ├── PDK.md
│ ├── Core.md
│ ├── Equipment.md
│ └── Workflow.md
├── package.json
└── .env.example
One-click start: npm run web, open browser at localhost:3456.
If this article makes you feel "I could probably do this too," that's the whole point of me writing it. Frontend has no ceiling, and JavaScript can do more than you think.
——Written on an afternoon when an AI assistant was successfully run using real company documents.