跪拜 Guibai
← All articles
Backend · Artificial Intelligence

Your RTX 5060 Ti Is Now a Local AI Colleague: Running Qwen3.6-35B-A3B with LM Studio and Open WebUI

By 雪隐_上班了 ·
Read original on juejin.cn ↗ Google Translate ↗ Alt translation

This workflow proves that a $300–400 consumer GPU can run a capable local LLM without cloud dependencies, API costs, or data leaks. For Western developers facing tightening privacy regulations or simply wanting offline AI, this is a replicable blueprint — not a theoretical benchmark.

Summary

A practical guide demonstrates how to transform a consumer-grade RTX 5060 Ti 16GB into a local AI workstation. The key is pairing LM Studio — a zero-configuration local model runner — with Qwen3.6-35B-A3B, a Mixture-of-Experts model from Alibaba's Tongyi Lab that packs 35 billion total parameters but activates only 3 billion per inference. This sparse activation keeps VRAM usage at roughly 16GB (Q4_K_M quantization), fitting perfectly on a single mid-range GPU.

The setup goes beyond basic chat. LM Studio exposes a fully OpenAI-compatible API, meaning any existing codebase that calls GPT can be redirected to the local model with a single URL change. For a richer interface, Open WebUI runs in Docker and connects to LM Studio's API, adding features like multi-model switching, chat history, file uploads, and plugin support — all while keeping data entirely offline.

The author also shares hard-won optimization tips: keep context length at 8K-16K (not the model's max 262K), close browser tabs to free VRAM, and drop to Q3_K_M quantization if memory runs tight. The result is a private, free, and unlimited AI assistant for coding, document Q&A, and sensitive data processing.

Takeaways
Qwen3.6-35B-A3B uses a Mixture-of-Experts (MoE) architecture: 35B total parameters, only 3B activated per inference.
The Q4_K_M quantized version fits in ~16GB VRAM, matching the RTX 5060 Ti 16GB's capacity exactly.
LM Studio provides an OpenAI-compatible API endpoint at localhost:1234/v1, enabling drop-in replacement for cloud APIs.
Open WebUI deployed via Docker adds a ChatGPT-like interface with file upload, history, and plugin support.
Recommended settings for 16GB VRAM: context_length 8192, gpu_layers 35, threads equal to CPU core count.
ModelScope download is recommended over HuggingFace for users in China due to faster speeds.
Data never leaves the local machine, making the setup suitable for medical, legal, or financial use cases.
The entire software stack — LM Studio, Open WebUI, Docker Desktop — is free and open-source.
Conclusions

The MoE architecture is the real enabler here: it lets a mid-range GPU run a model that would otherwise require a multi-GPU server, making local LLMs practical for individual developers.

The OpenAI API compatibility is a strategic design choice — it lowers switching costs to nearly zero, which is why LM Studio is gaining traction over more complex alternatives like Ollama or vLLM.

Running at 15.8/16.0 GB VRAM is a deliberate edge-case optimization. It works, but leaves no headroom for other GPU tasks, which limits multitasking.

The author's framing of the GPU as a 'colleague' reflects a broader cultural shift in Chinese developer communities: AI is seen as a productivity multiplier, not a threat.

Open WebUI's file upload feature solves a real pain point: LM Studio's native interface lacks vision support, so users with multimodal models must rely on third-party UIs.

Concepts & terms
Mixture of Experts (MoE)
A neural network architecture where different 'expert' subnetworks handle different types of inputs. Only a subset of experts is activated per inference, reducing computation while keeping high total parameter count.
Quantization (Q4_K_M)
A compression technique that reduces model precision from 16-bit floats to 4-bit integers, shrinking memory footprint with minimal quality loss. Q4_K_M is a specific quantization scheme balancing size and accuracy.
GGUF
A file format for storing quantized LLM weights, designed for efficient loading and inference on consumer hardware. It is the standard format used by LM Studio and llama.cpp.
LM Studio
A desktop application that downloads, manages, and runs local LLMs with GPU acceleration. It provides a GUI and an OpenAI-compatible API server, requiring no command-line setup.
Open WebUI
A self-hosted web interface for LLMs, compatible with OpenAI API endpoints. It offers features like chat history, file uploads, multi-model switching, and plugin support, running in Docker.
Source: juejin.cn ↗ Google Translate ↗ Backup ↗