Backend · Artificial Intelligence

Your RTX 5060 Ti Is Now a Local AI Colleague: Running Qwen3.6-35B-A3B with LM Studio and Open WebUI

By 雪隐_上班了 · Jun 25, 2026

Read original on juejin.cn ↗ Google Translate ↗ Alt translation

This workflow proves that a $300–400 consumer GPU can run a capable local LLM without cloud dependencies, API costs, or data leaks. For Western developers facing tightening privacy regulations or simply wanting offline AI, this is a replicable blueprint — not a theoretical benchmark.

Summary

A practical guide demonstrates how to transform a consumer-grade RTX 5060 Ti 16GB into a local AI workstation. The key is pairing LM Studio — a zero-configuration local model runner — with Qwen3.6-35B-A3B, a Mixture-of-Experts model from Alibaba's Tongyi Lab that packs 35 billion total parameters but activates only 3 billion per inference. This sparse activation keeps VRAM usage at roughly 16GB (Q4_K_M quantization), fitting perfectly on a single mid-range GPU.

The setup goes beyond basic chat. LM Studio exposes a fully OpenAI-compatible API, meaning any existing codebase that calls GPT can be redirected to the local model with a single URL change. For a richer interface, Open WebUI runs in Docker and connects to LM Studio's API, adding features like multi-model switching, chat history, file uploads, and plugin support — all while keeping data entirely offline.

The author also shares hard-won optimization tips: keep context length at 8K-16K (not the model's max 262K), close browser tabs to free VRAM, and drop to Q3_K_M quantization if memory runs tight. The result is a private, free, and unlimited AI assistant for coding, document Q&A, and sensitive data processing.

Takeaways

— Qwen3.6-35B-A3B uses a Mixture-of-Experts (MoE) architecture: 35B total parameters, only 3B activated per inference.

— The Q4_K_M quantized version fits in ~16GB VRAM, matching the RTX 5060 Ti 16GB's capacity exactly.

— LM Studio provides an OpenAI-compatible API endpoint at localhost:1234/v1, enabling drop-in replacement for cloud APIs.

— Open WebUI deployed via Docker adds a ChatGPT-like interface with file upload, history, and plugin support.

— Recommended settings for 16GB VRAM: context_length 8192, gpu_layers 35, threads equal to CPU core count.

— ModelScope download is recommended over HuggingFace for users in China due to faster speeds.

— Data never leaves the local machine, making the setup suitable for medical, legal, or financial use cases.

— The entire software stack — LM Studio, Open WebUI, Docker Desktop — is free and open-source.

Conclusions

The MoE architecture is the real enabler here: it lets a mid-range GPU run a model that would otherwise require a multi-GPU server, making local LLMs practical for individual developers.

The OpenAI API compatibility is a strategic design choice — it lowers switching costs to nearly zero, which is why LM Studio is gaining traction over more complex alternatives like Ollama or vLLM.

Running at 15.8/16.0 GB VRAM is a deliberate edge-case optimization. It works, but leaves no headroom for other GPU tasks, which limits multitasking.

The author's framing of the GPU as a 'colleague' reflects a broader cultural shift in Chinese developer communities: AI is seen as a productivity multiplier, not a threat.

Open WebUI's file upload feature solves a real pain point: LM Studio's native interface lacks vision support, so users with multimodal models must rely on third-party UIs.

Concepts & terms

Mixture of Experts (MoE)

A neural network architecture where different 'expert' subnetworks handle different types of inputs. Only a subset of experts is activated per inference, reducing computation while keeping high total parameter count.

Quantization (Q4_K_M)

A compression technique that reduces model precision from 16-bit floats to 4-bit integers, shrinking memory footprint with minimal quality loss. Q4_K_M is a specific quantization scheme balancing size and accuracy.

GGUF

A file format for storing quantized LLM weights, designed for efficient loading and inference on consumer hardware. It is the standard format used by LM Studio and llama.cpp.

LM Studio

A desktop application that downloads, manages, and runs local LLMs with GPU acceleration. It provides a GUI and an OpenAI-compatible API server, requiring no command-line setup.

Open WebUI

A self-hosted web interface for LLMs, compatible with OpenAI API endpoints. It offers features like chat history, file uploads, multi-model switching, and plugin support, running in Docker.

Source: juejin.cn ↗ Google Translate ↗ Backup ↗