Your RTX 5060 Ti Is Now a Local AI Colleague: Running Qwen3.6-35B-A3B with LM Studio and Open WebUI
by 雪隐_上班了 from juejin.cn/user/143341…\n> Welcome to share and aggregate, but please don't repost the full article. Respect copyright. The community is small. If you need it urgently, contact me for authorization.
Preface: My 5060Ti Is Becoming More and More Versatile — It's Now My Colleague
What did we talk about last time? First time: 5060Ti learned to draw (FLUX.2 Klein), so now the product manager no longer chases me to change images. Second time: 5060Ti learned to mimic human speech (Qwen3-TTS). After I cloned my boss's voice, whatever I say goes.
This time it's even more impressive — I taught it to chat with itself.
That's right, local large language model (LLM) deployment turns my 5060Ti 16GB into an AI colleague that can converse anytime. No internet required, no API costs, no worry about sensitive data being sent to the cloud. Chat as long as you want, and it never gets annoyed.
This time I'm using LM Studio + Qwen3.6-35B-A3B, a big MoE (Mixture of Experts) model — 35 billion total parameters, but only 3 billion activated. It's like a strongman using only a fraction of his strength, saving power while being efficient.
And my 5060Ti 16GB is just enough to feed it (the quantized Q4_K_M version is about 16GB, right at the edge of full VRAM — exciting, right?).
Project Overview (Check Out My "Cyber Colleague's" Configuration)
Hardware Environment (My Precious Computer)
| Configuration | My Actual Situation | Comments |
|---|---|---|
| GPU | RTX 5060 Ti 16GB | It hurt to buy it, but the more I use it, the more I feel it's worth it |
| RAM | 32GB (DDR5, I bit the bullet) | AI eats RAM like a starving person at a buffet; 32GB barely cuts it |
| Storage | 50GB free | I deleted a bunch of study materials to make room for the model (don't ask what kind) |
| System | Windows 11 | As stable as my habit of slacking off on time every day |
Software Environment (All Free Tools, Kudos)
| Software | Description | My Review |
|---|---|---|
| LM Studio | The ultimate local model deployment tool, ready out of the box | A thousand times easier to set up than a Python environment; just double-click and run |
| GPU Driver | NVIDIA Driver 535+ | Just install it, no need to fiddle with CUDA; LM Studio handles everything |
| CUDA | Built into LM Studio | It installs itself quietly; you don't even need to know where it is |
About the Model: Qwen3.6-35B-A3B, the "Saver"
This is an open-source MoE (Mixture of Experts) model from Alibaba's Tongyi Lab. In plain English:
It doesn't make all 35 billion employees work together. Instead, based on the task type, it only wakes up the 3 billion experts best suited for the job, while the rest stay asleep. — Hence the term "sparse activation": saves power, saves VRAM, and is fast.
| Parameter | Value | Translation |
|---|---|---|
| Total Parameters | 35 billion | Total company employees |
| Activated Parameters | 3 billion | Only these people are woken up each time |
| Architecture | Sparse Mixture of Experts (MoE) | Each expert has their own specialty |
| Context Length | 262K tokens | Equivalent to reading the entire "Three-Body Problem" trilogy in one go while remembering the details |
How great is the MoE architecture? The capability of 35 billion, but the computation of 3 billion. My 5060Ti 16GB runs it without the fans even speeding up.
How to Choose the Quantization Version? Look at Your VRAM, Don't Be Greedy
Model files come in different "quantization levels," like compressing an image: more compression means slightly lower quality but much smaller size.
| Quantization Level | VRAM Requirement | Quality | Suitable For |
|---|---|---|---|
| Q4_K_M | ~16GB | High (basically imperceptible loss) | 5060Ti users like me, right at the edge of full VRAM, very exciting to use |
| Q5_K_S | ~18GB | Very high | Rich folks with 20GB+ VRAM |
| Q8_0 | ~22GB | Near lossless | Go for it,土豪 (rich people) |
I chose Q4_K_M — 16GB VRAM is almost fully utilized, but inference speed is still smooth.
Every time I see VRAM: 15.8GB / 16.0GB at the bottom of LM Studio, I feel like I'm walking a tightrope, but I never fall — it's awesome.
What is LM Studio? In One Sentence: The "Point-and-Shoot Camera" for Your GPU
Why Use It?
| Feature | Meaning for a Lazy Person Like Me |
|---|---|
| 🎯 Ready Out of the Box | Download → Double-click to install → Open → Select model → Start chatting. No command line needed at all |
| 🔧 Ridiculously Simple | The interface has just a few buttons, simpler than Word |
| 🔥 GPU Acceleration | Automatically detects my 5060Ti, enables it with one click, no need to set CUDA_VISIBLE_DEVICES |
| 🌐 Local API | Provides an API identical to OpenAI's, so my code can switch seamlessly |
| 📁 Built-in Model Downloader | Search for models within the software, click to download, no need to browse the web |
Installation Steps (Really Just Three Steps, I Swear)
Step 1: Download LM Studio
Visit the official website: https://lmstudio.ai/ Click to download the Windows version. The installer is about 200MB, faster than downloading a game patch.
The installation process is just "Next → Next → I Agree → Finish" all the way, no pitfalls.
Step 2: Download the Model (Two Methods, Recommend the Second)
Method 1: LM Studio Built-in Downloader (Suitable for Good Internet)
- Open LM Studio, click the "Search" icon on the left
- Search for
Qwen3.6-35B - In the results, find the
Q4\_K\_Mversion (check the file size, about 16GB) - Click Download, then go get a coffee and wait for it to finish
Method 2: Download via ModelScope (Suitable for China, Super Fast) I highly recommend this because HuggingFace is as slow as a snail in China:
pip install modelscope
modelscope download --model LLM-Research/Qwen3.6-35B-A3B-GGUF --local_dir ./models
It takes about 16GB to download, which took me about the time of one takeout meal (about 20 minutes).
Then drag the downloaded .gguf file into LM Studio's model folder (or just drag it into the software window, it recognizes it).
Step 3: Load the Model and Start Chatting
- In the middle area of LM Studio, click to load, or just drag the
.gguffile in - Wait about 1-2 minutes (it's loading the model into VRAM)
- When you see
Qwen3.6-35B-A3B-Q4_K_M - 15.8GB/16.0GBat the bottom, it's successful - Type in the input box below, press Enter, and it starts answering
Shortcuts (Just remember these two):
Ctrl + Enter: Send messageCtrl + Shift + Delete: Clear conversation (pretend nothing happened)
Local API Service: Let My Code Tease It Too
LM Studio comes with an API server compatible with OpenAI, meaning I can use any code that calls OpenAI and seamlessly switch to the local model.
Start the API Server (One-Click Operation)
- Click the "Server" icon on the left (looks like a plug)
- Click the "Start Server" button
- See the address:
http://localhost:1234/v1
That's it? That's it. No need to configure environment variables, no uvicorn, one click and it's done.
Python Call (Exactly Like Calling GPT)
from openai import OpenAI
# Change base_url to LM Studio's address, everything else stays the same
client = OpenAI(
base_url="http://localhost:1234/v1",
api_key="lm-studio" # Fill in anything, it doesn't verify
)
response = client.chat.completions.create(
model="Qwen3.6-35B-A3B", # Model name can be anything; it will automatically use the currently loaded one
messages=[
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": "Explain MoE architecture in one sentence"}
]
)
print(response.choices[0].message.content)
# Output: MoE architecture is like a team of experts, each time only letting the group best suited for the task work, saving time and effort.
curl Call (For Those Who Don't Like Python)
curl http://localhost:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3.6-35B-A3B",
"messages": [{"role": "user", "content": "Hello, who are you?"}]
}'
And just like that, my local AI colleague is officially on the job.
Docker Deployment of Open WebUI (Dressing Up the Model in a Nice Outfit)
LM Studio's built-in chat interface is relatively basic, like a rough house. If you want a more advanced web interface, you can install Open WebUI, which looks like ChatGPT and has more features.
Step 1: Install Docker Desktop
Go to https://www.docker.com/products/docker-desktop/ to download Docker Desktop, install it, and restart your computer.
Step 2: Start LM Studio's API Server
(Same as above, click "Start Server", keep it running)
Step 3: Start Open WebUI with Docker
Execute in the command line:
docker run -d \
--name open-webui \
-p 3000:8080 \
-v open-webui:/app/backend/data \
-e OPENAI_API_BASE_URL=http://host.docker.internal:1234/v1 \
-e OPENAI_API_KEY=lm-studio \
--restart unless-stopped \
ghcr.io/open-webui/open-webui:main
A few key points (to save you from pitfalls):
host.docker.internalis the fixed way to access the host machine from within Docker; don't change it tolocalhost- Port
1234is LM Studio's API port - Port
3000is Open WebUI's access port
Step 4: Access and Use
- Open your browser and visit: http://localhost:3000
- On first open, you need to register an account (stored locally, fill in anything, don't make the password too complex, just remember it yourself)
- After logging in, confirm in the settings that it's connected to LM Studio
What Makes Open WebUI Better Than LM Studio's Built-in Interface?
| Feature | Description | My Use Case |
|---|---|---|
| 🤖 Multi-Model Switching | Switch between different models in one interface | Load Qwen and Llama simultaneously, compare answers |
| 📝 Chat History | Automatically saved, searchable | Can still find questions I asked last week |
| 📁 File Upload | Upload PDF/Word for AI analysis | Throw company documents in for summarization (data never leaves local) |
| 🧩 Plugin System | Install various extensions | Supports web search, code execution, etc. |
| 🌐 Theme Switching | Dark/Light mode | Switch to dark mode at night to protect eyes |
For example, with the previously used vision-capable gemma-4-26b-a4b-qat, if using LM Studio, you can't upload images at all, but openwebui can.
Common Docker Commands (Keep Here for Reference)
# Check if the container is running
docker ps
# Check logs if there's a problem
docker logs -f open-webui
# Stop it
docker stop open-webui
# Delete and start over (configuration remains because of the mounted volume)
docker rm open-webui
# Restart it
docker start open-webui
Performance Optimization: How to Run My 5060Ti 16GB at Full Power
Recommended Configuration (Adjust in LM Studio's "Model Settings" on the Right)
{
"context_length": 8192, // 8K is enough for daily use; don't enable 262K, it will blow VRAM
"gpu_layers": 35, // Put all layers on GPU for maximum speed
"threads": 8, // My CPU has 8 cores, use them all
"batch_size": 512 // Default value, no need to change
}
Three Tips to Save VRAM (Learned the Hard Way)
Don't Max Out the Context Qwen3.6 supports 262K tokens, but that's for multi-GPU rich folks. With 16GB VRAM, stick to 8K-16K, rock solid.
Close Other VRAM-Hungry Programs Browsers, IDEs, Chrome are all VRAM killers. Before running the model, I close extra Chrome tabs to free up 1-2GB.
If VRAM Still Explodes, Choose a Lower Quantization Q4_K_M is 16GB. If it still blows, try Q3_K_M (about 12GB). Quality drops slightly, but smoothness improves.
What Can I Do with It? (More Than Just Chatting)
1. Programming Assistant (My Most Common Use)
# Ask it to write code; it never complains about requirement changes
response = client.chat.completions.create(
model="Qwen3.6-35B-A3B",
messages=[
{"role": "user", "content": "Write a quicksort in Python with comments"}
]
)
2. Private Document Q&A (Company Code Stays In-House)
Feed it internal company documents, ask anything, data never leaves the local server.
3. Integrate with LangChain (Get Creative)
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(
base_url="http://localhost:1234/v1",
model="Qwen3.6-35B-A3B"
)
# Then it's exactly like calling GPT, but free, private, and unlimited
4. Sensitive Data Processing (Medical/Legal/Financial)
Can't upload client data to the cloud? Local model solves it perfectly. A friend of mine in medical IT uses this exact setup for processing medical record summaries — compliant and secure.
Frequently Asked Questions (All Pitfalls I've Encountered)
Q1: VRAM Exploded, What to Do?
- Switch to Q3_K_M version (about 12GB)
- Reduce
context_lengthfrom 8192 to 4096 - Close all Chromium browsers (they eat VRAM like water)
Q2: Response Speed Is as Slow as a Snail?
- Check if LM Studio's bottom shows
GPU Layers: 0/35; if so, it's not using the GPU - Go to "Settings" and confirm GPU is checked
- Increase
threads(equal to your CPU core count)
Q3: Open WebUI Can't Connect to LM Studio?
- Confirm LM Studio's "Server" is green (Running)
- Confirm port 1234 is not occupied by another program
- Use
host.docker.internalinside the Docker container, notlocalhost
Summary: My 5060Ti Has Become an All-Round Worker
Today, using LM Studio + Qwen3.6-35B-A3B, we unlocked another skill for the 5060Ti:
✅ Zero-Barrier Local Deployment: LM Studio is ready out of the box; even a lazy person like me can run it ✅ MoE Architecture Saves VRAM: 35 billion parameters, only 3 billion activated, 16GB runs it perfectly ✅ Compatible with OpenAI API: Zero code changes to switch from cloud to local ✅ Open WebUI Enhancement: Beautiful interface, powerful features, file upload support ✅ Data Privacy: Sensitive information never leaves local, compliant and worry-free
Now my 5060Ti can draw, do voiceovers, and chat. Next time, should I teach it to write code? Oh wait, it's already helping me write code right now... Does that mean I'm going to be unemployed? 🤔
Code for this chapter — take it, no thanks needed.
Preview of Next Issue: Maybe I'll run video generation on the 5060Ti, or set up a multimodal model so it can "look at pictures and talk"? Anyway, I've already bought the graphics card and spent the money; I have to squeeze every drop of computing power out of it! 💪
Reference Links (Really Useful):
- LM Studio Official Website: https://lmstudio.ai/
- Qwen3.6-35B-A3B Model: https://huggingface.co/Qwen/Qwen3.6-35B-A3B
- ModelScope Model: https://www.modelscope.cn/models/LLM-Research/Qwen3.6-35B-A3B
One Last Heartfelt Piece of Advice: No matter how powerful a local model is, it can't write a sentence like "I love you" — it will only give you a love poem and then ask, "Would you like to optimize it further?" So, don't fear unemployment; you still have a heart that can be moved, unlike AI. ❤️