A 270-Line Node Server That Reads Your Private Codebase, Built in an Afternoon

1. The Reason: Drawing Pages All Day, Soon to Be Replaced?

Earlier this year, Cursor could write components, and GitHub Copilot could auto-complete. A company leader said in a meeting: "Can our GitLab MRs be automatically reviewed by AI?"

After the meeting, I sat at my desk and thought about two things:

No matter how well you draw pages, the incremental market will eventually be eaten by AI.
But the last mile of AI implementation—rendering UI, typewriter effects, streaming transmission, connecting to DevOps pipelines—is all frontend work.

So I decided: instead of chasing the next frontend framework, I'll use my most familiar tool, JavaScript, to build an AI assistant that can read the private documentation of our four projects for the team.

2. See the Effect First (Simpler Than You Think)

One afternoon, npm install + a few lines of code, and the result:

📚 Local Knowledge Base Mode:

👨‍💻 Q: What is the difference between the E1 and E2 APIs?

🤖 A (📚 Local Docs · 4 snippets):

Dimension E1 (legacy) E2 (new)

Prefix /api/e1 /api/e2

Success Indicator status: 0 code: 200

Ajax Factory Ajax() createE2Ajax()

Dimension	E1 (legacy)	E2 (new)
Prefix	`/api/e1`	`/api/e2`
Success Indicator	`status: 0`	`code: 200`
Ajax Factory	`Ajax()`	`createE2Ajax()`

——This is a real convention from my project's AGENTS.md, something ChatGPT could never answer.

🌐 Web Search Mode (auto-switches when local docs miss):

👨‍💻 Q: What's the weather like in Shenzhen today?

🤖 A (🌐 Web Search · 4 results): Shenzhen today is cloudy turning sunny, temperature 28-34°C...

——This step integrates the Bocha AI search engine, automatically falling back to the web when local hits are insufficient.

A whole page of code just runs, without depending on any external database, vector service, or Python environment.

3. Minimalist Architecture: If It Can Be One File, Don't Open Another Port

Question → 
  │
  ├─ Identify if it's a local question? (Contains pdk/core/equipment/workflow keywords?)
  │   ├─ Yes → RAG retrieves local docs/*.md → LLM judges if snippets are relevant?
  │   │       ├─ Relevant → Answer using local docs
  │   │       └─ Not relevant → Web search
  │   └─ No → Web search
  │
  └─ Stream output + Typewriter cursor + Markdown rendering

Four tech stacks all run in a single server.js (~270 lines):

Layer	Tech	Role
Model	DeepSeek (via LangChain)	Understands questions, generates answers
Vectorization	transformers.js + BGE-small-zh	Converts docs to vectors locally, no API calls, no cost
RAG	MemoryVectorStore	In-memory vector store, retrieves the most relevant doc snippets
Web	Bocha Web Search API	Auto web search when local docs miss
Delivery	Node native http + stream chunked	Typewriter effect + Markdown rendering

No Chroma/Pinecone, no Docker, no Python virtual environment.

4. Core Code Breakdown (Just the Essentials)

4.1 Local Embedding: $0, 0 API, Pure JS

The most hardcore and satisfying part of the whole system—I didn't use the OpenAI Embedding API, but an npm package called @huggingface/transformers, running the BGE-small-zh model directly in a local Node process:

import { pipeline } from '@huggingface/transformers';

const pipe = await pipeline('feature-extraction', 'Xenova/bge-small-zh-v1.5', { dtype: 'fp32' });
const vec = await pipe('The difference between E1 and E2 APIs', { pooling: 'mean', normalize: true });
// vec.data = Float32Array[384] ← This text became a 384-dimensional vector

Then wrapped as a LangChain Embeddings interface:

class LocalEmbeddings {
  async embedQuery(text) { return Array.from((await this.pipe(text)).data); }
  async embedDocuments(texts) { return Promise.all(texts.map(t => this.embedQuery(t))); }
}

The first download of the model is ~100MB, after that it starts instantly. Calculated for 10,000 calls, OpenAI Embedding would cost roughly tens of dollars—this solution: zero.

⚠️ Pitfall: huggingface.co is blocked in China. Solved using the hf-mirror.com mirror:
import { env } from '@huggingface/transformers';
env.remoteHost = 'https://hf-mirror.com';

4.2 Complete RAG Flow (Read Docs → Chunk → Vectorize → Retrieve → Feed LLM)

// 1. Read all .md files under the docs/ directory
const files = readdirSync('./docs').filter(f => extname(f) === '.md');
const raw = files.map(f => readFileSync(join('./docs', f), 'utf-8')).join('\n\n');

// 2. Chunk (500 chars per chunk, 80 char overlap)
const splitter = new RecursiveCharacterTextSplitter({ chunkSize: 500, chunkOverlap: 80 });
const docs = await splitter.splitDocuments([new Document({ pageContent: raw })]);

// 3. Vectorize + store in memory
const store = await MemoryVectorStore.fromDocuments(docs, new LocalEmbeddings());

// 4. Retrieve the 4 most relevant chunks when asking a question
const hits = await store.similaritySearchWithScore(question, 4);

// 5. Feed to DeepSeek
const context = hits.map(([doc]) => doc.pageContent).join('\n\n');
const answer = await model.invoke([
  { role: 'system', content: `Answer only using the following documents:\n\n${context}` },
  { role: 'user', content: question }
]);

Once these five steps run through, your AI assistant can answer questions using your private documentation. The knowledge base currently holds the AGENTS.md files from our team's four projects: pdk, core, equipment, and workflow, totaling about 30KB of documents.

4.3 Dual-Path Retrieval: Local First, LLM Judgment as Fallback, Web Search Only When Unreliable

Initially, I took a shortcut using a vector similarity threshold—topScore > 0.5 counted as a hit. As a result, for a question like "today's international crude oil price," which has nothing to do with code, BGE-small gave a similarity score of 0.4...

Switched to letting the LLM judge for itself:

const checkChunks = hits.map(([doc]) => doc.pageContent.slice(0, 250)).join('\n---\n');
const check = await model.invoke([
  { role: 'system', content: 'Strictly judge: Can these snippets practically answer the user\'s question? If unsure, answer "cannot". Only reply can or cannot.' },
  { role: 'user', content: `Question: ${q}\n\nSnippets:\n${checkChunks}` }
]);

if (check.content.includes('can')) {
  /* Use local docs */
} else {
  /* Use Bocha web search */
}

Ten times more reliable than a vector threshold. The LLM itself knows that "international crude oil" has nothing to do with the code conventions in AGENTS.md.

4.4 Streaming Output + Typewriter Cursor + Markdown Rendering

Three details that elevate the experience from "usable" to "pleasant":

// Backend: DeepSeek stream
const stream = await model.stream(messages);
res.writeHead(200, { 'Transfer-Encoding': 'chunked' });
res.write(`SOURCE:${source}|${count}\n`);
for await (const chunk of stream) {
  if (chunk.content) res.write(chunk.content);
}
res.end();

// Frontend: ReadableStream appends character by character
const reader = res.body.getReader();
while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  raw += decoder.decode(value, { stream: true });
  body.textContent = raw; // Characters pop out one by one
}
body.innerHTML = marked.parse(raw); // Render Markdown after completion

Paired with a CSS blinking cursor:

@keyframes blink { 0%,100% { opacity: 1 } 50% { opacity: 0 } }
.streaming::after { content: "|"; animation: blink .8s infinite; color: #7ee787; }

Characters pop out one by one, a green cursor blinks at the end—this is the "human touch" frontend adds to AI.

5. Pitfalls Encountered (This Saves Your Time)

Pitfall	Symptom	Solution
huggingface blocked	transformers.js model download times out	Set `env.remoteHost = 'https://hf-mirror.com'`
Node v25 undici ignores HTTP_PROXY	Proxy set but fetch still connects directly	Use mirror instead of proxy, or `setGlobalDispatcher`
Vector threshold unreliable	Unrelated questions get high scores	Switch to LLM judging relevance
DeepSeek key with quotes	`.env` with `KEY='sk-…'` → Auth fails	No quotes: `KEY=sk-…`
Nested template strings	JS nested `${}` inside HTML causes errors	Use concatenation instead of templates for the outer layer

6. Why Frontend, Not Python?

Because the last mile of AI implementation is all frontend work:

Typewriter effect, Markdown rendering → Frontend
Streaming data transmission (SSE / chunked transfer) → Frontend
Connecting AI into GitLab CI, Feishu Bot, DevOps → Frontend

Python can call APIs, write RAG, run models—but making AI "smoothly usable" is frontend.

My advice: stop grinding on the next frontend wheel. You are already a JS expert, LangChain has a complete JS SDK, DeepSeek has an OpenAI-compatible interface—you just need one weekend.

7. Next Steps

This setup is currently running in our internal demos. Next steps planned:

Connect to GitLab API, auto-pull diff for each MR → RAG retrieves project conventions → AI generates review report
Expand to Feishu Bot, team members can ask by @ mentioning it
Feed in Swagger JSON too, "What fields does this interface return?" answered in seconds

Source Code

Generate it yourself using AI if needed.

Directory structure:

ai-share-demo/
├── server.js          # Main program (~270 lines, contains all features)
├── step1-hello.js     # Demo 1: 5 lines of JS to call DeepSeek
├── step2-structured.js # Demo 2: Zod structured JSON output
├── step3-rag.js       # Demo 3: RAG standalone CLI version
├── docs/              # Knowledge base docs (add .md to expand knowledge)
│   ├── PDK.md
│   ├── Core.md
│   ├── Equipment.md
│   └── Workflow.md
├── package.json
└── .env.example

One-click start: npm run web, open browser at localhost:3456.

If this article makes you feel "I could probably do this too," that's the whole point of me writing it. Frontend has no ceiling, and JavaScript can do more than you think.

——Written on an afternoon when an AI assistant was successfully run using real company documents.