A 200-Line Python Flask Service That Posts AI Code Reviews Straight to GitLab MRs
Teams waste hours on repetitive review comments that an LLM can catch in seconds. This pattern turns a spec document into an enforceable, always-on reviewer without touching the CI runner's language or framework, and the 200-line footprint means a single developer can own it.
A lightweight Python service automates the first pass of code review by connecting GitLab CI to any OpenAI-compatible large language model. When a merge request opens, the service fetches the diff via GitLab's API, constructs a prompt that includes a synced team specification file, and calls the model for a structured review. The resulting JSON—containing a score, file-level issues, and severity—gets posted back to the MR as a comment.
The entire stack runs on roughly 200 lines of Flask code. A separate Git repository holds the review rules, which the service clones on startup so that changing the spec never requires touching the application code. Environment variables control the model endpoint, GitLab token, and spec repo URL, making it straightforward to swap between OpenAI, a local Ollama model, or a corporate LLM proxy.
Frontend developers who know Express and axios can follow the logic directly: Flask maps to Express, the `requests` library maps to axios, and `python-dotenv` mirrors the Node `dotenv` package. The guide walks through installation, `.env` configuration, a line-by-line breakdown of the code, GitLab CI integration, and the most common failure modes—401 token errors, Docker networking gotchas, and JSON parsing failures from the model.
Treating the review specification as a separate, version-controlled artifact decouples policy from implementation and lets non-engineers contribute rules.
The hardest part of an AI review pipeline is not the model call but the plumbing: GitLab API authentication, diff extraction, and reliable JSON parsing from LLM output.
Flask's synchronous-by-default model is a better fit for small integration services than async Python frameworks because it eliminates an entire class of concurrency bugs.
Hardcoding a test token in a local-only service is a pragmatic trade-off that avoids secret-management overhead during development, provided it is replaced before any network exposure.
Frontend developers can transfer their mental model of Express + axios directly to Flask + requests; the syntax differences are minor compared to the architectural patterns.
LLM-based review works best as a first-pass filter for mechanical issues—null checks, naming conventions, missing error handling—rather than as a replacement for architectural judgement.