Building a Tiny JavaScript VM in C++: Lexer, Parser, AST Interpreter, and the Road to Bytecode
Most developers treat JavaScript engines as magic. Building a tiny one from scratch exposes the concrete data structures and algorithms behind every `let`, `while`, and function call, and provides a mental model that directly improves debugging and performance work in real engines like V8.
A new C++ project, MiniJS, builds a working JavaScript VM from scratch, not to compete with V8, but to isolate and understand each layer of a language engine. The current implementation runs a controlled JS subset through a full pipeline: a lexer tokenizes source code, a recursive-descent parser builds an AST, and a tree-walking interpreter executes it against a runtime with scoped environments, functions, arrays, and error handling. The architecture deliberately avoids the full ECMAScript spec to prevent drowning in edge cases before the core loop is solid.
The project is designed for phased evolution. With the AST interpreter serving as a semantic baseline, the next steps compile the AST to a stack-based bytecode VM, introduce a proper object system with prototype chains, and eventually implement a mark-sweep garbage collector. Each phase produces testable, verifiable output, and the same test suite will validate both the interpreter and the future VM to ensure behavioral parity.
This approach treats a language engine as an engineering problem to be decomposed, not a monolith to be copied. The result is a learning tool that demonstrates how tokens become trees, how scope chains resolve variables, and how a runtime can grow from a simple evaluator into a managed, stack-based virtual machine.
Treating the AST interpreter as a semantic oracle for the future bytecode VM is a practical testing strategy that commercial engine teams also use.
Separating the Lexer from the Parser so strictly—no grammar validation in the tokenizer—is a design discipline that prevents a common source of tangled, hard-to-test frontends.
Building a JS subset first is not a compromise; it is a recognition that language complexity is multiplicative, and stabilizing the core pipeline prevents exponential debugging costs.
The project implicitly argues that understanding an engine requires building the wrong thing first (a slow tree interpreter) to establish what correct behavior even looks like.
Reference semantics for arrays sneak in heap-like complexity early, forcing the runtime to confront aliasing and mutation before objects formally exist.
Many developers conflate 'learning a language' with 'learning its implementation,' but this project shows that the implementation is a separate, layered system where each layer can be understood in isolation.