跪拜 Guibai
← Back to the summary

Your RTX 5060 Ti Is Now a Local AI Colleague: Running Qwen3.6-35B-A3B with LM Studio and Open WebUI

by 雪隐_上班了 from juejin.cn/user/143341…\n> Welcome to share and aggregate, but please don't repost the full article. Respect copyright. The community is small. If you need it urgently, contact me for authorization.


Preface: My 5060Ti Is Becoming More and More Versatile — It's Now My Colleague

What did we talk about last time? First time: 5060Ti learned to draw (FLUX.2 Klein), so now the product manager no longer chases me to change images. Second time: 5060Ti learned to mimic human speech (Qwen3-TTS). After I cloned my boss's voice, whatever I say goes.

This time it's even more impressive — I taught it to chat with itself.

That's right, local large language model (LLM) deployment turns my 5060Ti 16GB into an AI colleague that can converse anytime. No internet required, no API costs, no worry about sensitive data being sent to the cloud. Chat as long as you want, and it never gets annoyed.

This time I'm using LM Studio + Qwen3.6-35B-A3B, a big MoE (Mixture of Experts) model — 35 billion total parameters, but only 3 billion activated. It's like a strongman using only a fraction of his strength, saving power while being efficient.

And my 5060Ti 16GB is just enough to feed it (the quantized Q4_K_M version is about 16GB, right at the edge of full VRAM — exciting, right?).


Project Overview (Check Out My "Cyber Colleague's" Configuration)

Hardware Environment (My Precious Computer)

Configuration My Actual Situation Comments
GPU RTX 5060 Ti 16GB It hurt to buy it, but the more I use it, the more I feel it's worth it
RAM 32GB (DDR5, I bit the bullet) AI eats RAM like a starving person at a buffet; 32GB barely cuts it
Storage 50GB free I deleted a bunch of study materials to make room for the model (don't ask what kind)
System Windows 11 As stable as my habit of slacking off on time every day

Software Environment (All Free Tools, Kudos)

Software Description My Review
LM Studio The ultimate local model deployment tool, ready out of the box A thousand times easier to set up than a Python environment; just double-click and run
GPU Driver NVIDIA Driver 535+ Just install it, no need to fiddle with CUDA; LM Studio handles everything
CUDA Built into LM Studio It installs itself quietly; you don't even need to know where it is

About the Model: Qwen3.6-35B-A3B, the "Saver"

This is an open-source MoE (Mixture of Experts) model from Alibaba's Tongyi Lab. In plain English:

It doesn't make all 35 billion employees work together. Instead, based on the task type, it only wakes up the 3 billion experts best suited for the job, while the rest stay asleep. — Hence the term "sparse activation": saves power, saves VRAM, and is fast.

Parameter Value Translation
Total Parameters 35 billion Total company employees
Activated Parameters 3 billion Only these people are woken up each time
Architecture Sparse Mixture of Experts (MoE) Each expert has their own specialty
Context Length 262K tokens Equivalent to reading the entire "Three-Body Problem" trilogy in one go while remembering the details

How great is the MoE architecture? The capability of 35 billion, but the computation of 3 billion. My 5060Ti 16GB runs it without the fans even speeding up.


How to Choose the Quantization Version? Look at Your VRAM, Don't Be Greedy

Model files come in different "quantization levels," like compressing an image: more compression means slightly lower quality but much smaller size.

Quantization Level VRAM Requirement Quality Suitable For
Q4_K_M ~16GB High (basically imperceptible loss) 5060Ti users like me, right at the edge of full VRAM, very exciting to use
Q5_K_S ~18GB Very high Rich folks with 20GB+ VRAM
Q8_0 ~22GB Near lossless Go for it,土豪 (rich people)

I chose Q4_K_M — 16GB VRAM is almost fully utilized, but inference speed is still smooth. Every time I see VRAM: 15.8GB / 16.0GB at the bottom of LM Studio, I feel like I'm walking a tightrope, but I never fall — it's awesome.


What is LM Studio? In One Sentence: The "Point-and-Shoot Camera" for Your GPU

Why Use It?

Feature Meaning for a Lazy Person Like Me
🎯 Ready Out of the Box Download → Double-click to install → Open → Select model → Start chatting. No command line needed at all
🔧 Ridiculously Simple The interface has just a few buttons, simpler than Word
🔥 GPU Acceleration Automatically detects my 5060Ti, enables it with one click, no need to set CUDA_VISIBLE_DEVICES
🌐 Local API Provides an API identical to OpenAI's, so my code can switch seamlessly
📁 Built-in Model Downloader Search for models within the software, click to download, no need to browse the web

Installation Steps (Really Just Three Steps, I Swear)

Step 1: Download LM Studio

Visit the official website: https://lmstudio.ai/ Click to download the Windows version. The installer is about 200MB, faster than downloading a game patch.

The installation process is just "Next → Next → I Agree → Finish" all the way, no pitfalls.

Step 2: Download the Model (Two Methods, Recommend the Second)

Method 1: LM Studio Built-in Downloader (Suitable for Good Internet)

  1. Open LM Studio, click the "Search" icon on the left
  2. Search for Qwen3.6-35B
  3. In the results, find the Q4\_K\_M version (check the file size, about 16GB)
  4. Click Download, then go get a coffee and wait for it to finish

Method 2: Download via ModelScope (Suitable for China, Super Fast) I highly recommend this because HuggingFace is as slow as a snail in China:

pip install modelscope
modelscope download --model LLM-Research/Qwen3.6-35B-A3B-GGUF --local_dir ./models

It takes about 16GB to download, which took me about the time of one takeout meal (about 20 minutes).

Then drag the downloaded .gguf file into LM Studio's model folder (or just drag it into the software window, it recognizes it).

Step 3: Load the Model and Start Chatting

  1. In the middle area of LM Studio, click to load, or just drag the .gguf file in
  2. Wait about 1-2 minutes (it's loading the model into VRAM)
  3. When you see Qwen3.6-35B-A3B-Q4_K_M - 15.8GB/16.0GB at the bottom, it's successful
  4. Type in the input box below, press Enter, and it starts answering

Shortcuts (Just remember these two):


Local API Service: Let My Code Tease It Too

LM Studio comes with an API server compatible with OpenAI, meaning I can use any code that calls OpenAI and seamlessly switch to the local model.

Start the API Server (One-Click Operation)

  1. Click the "Server" icon on the left (looks like a plug)
  2. Click the "Start Server" button
  3. See the address: http://localhost:1234/v1

That's it? That's it. No need to configure environment variables, no uvicorn, one click and it's done.

Python Call (Exactly Like Calling GPT)

from openai import OpenAI

# Change base_url to LM Studio's address, everything else stays the same
client = OpenAI(
    base_url="http://localhost:1234/v1",
    api_key="lm-studio"  # Fill in anything, it doesn't verify
)

response = client.chat.completions.create(
    model="Qwen3.6-35B-A3B",  # Model name can be anything; it will automatically use the currently loaded one
    messages=[
        {"role": "system", "content": "You are a helpful assistant"},
        {"role": "user", "content": "Explain MoE architecture in one sentence"}
    ]
)

print(response.choices[0].message.content)
# Output: MoE architecture is like a team of experts, each time only letting the group best suited for the task work, saving time and effort.

curl Call (For Those Who Don't Like Python)

curl http://localhost:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3.6-35B-A3B",
    "messages": [{"role": "user", "content": "Hello, who are you?"}]
  }'

And just like that, my local AI colleague is officially on the job.


Docker Deployment of Open WebUI (Dressing Up the Model in a Nice Outfit)

LM Studio's built-in chat interface is relatively basic, like a rough house. If you want a more advanced web interface, you can install Open WebUI, which looks like ChatGPT and has more features.

Step 1: Install Docker Desktop

Go to https://www.docker.com/products/docker-desktop/ to download Docker Desktop, install it, and restart your computer.

Step 2: Start LM Studio's API Server

(Same as above, click "Start Server", keep it running)

Step 3: Start Open WebUI with Docker

Execute in the command line:

docker run -d \
  --name open-webui \
  -p 3000:8080 \
  -v open-webui:/app/backend/data \
  -e OPENAI_API_BASE_URL=http://host.docker.internal:1234/v1 \
  -e OPENAI_API_KEY=lm-studio \
  --restart unless-stopped \
  ghcr.io/open-webui/open-webui:main

A few key points (to save you from pitfalls):

Step 4: Access and Use

  1. Open your browser and visit: http://localhost:3000
  2. On first open, you need to register an account (stored locally, fill in anything, don't make the password too complex, just remember it yourself)
  3. After logging in, confirm in the settings that it's connected to LM Studio

What Makes Open WebUI Better Than LM Studio's Built-in Interface?

Feature Description My Use Case
🤖 Multi-Model Switching Switch between different models in one interface Load Qwen and Llama simultaneously, compare answers
📝 Chat History Automatically saved, searchable Can still find questions I asked last week
📁 File Upload Upload PDF/Word for AI analysis Throw company documents in for summarization (data never leaves local)
🧩 Plugin System Install various extensions Supports web search, code execution, etc.
🌐 Theme Switching Dark/Light mode Switch to dark mode at night to protect eyes

Interface screenshot, looks very advanced

For example, with the previously used vision-capable gemma-4-26b-a4b-qat, if using LM Studio, you can't upload images at all, but openwebui can.

fb69ecfd-f504-4564-9ca1-c13d6627fa13.png

Common Docker Commands (Keep Here for Reference)

# Check if the container is running
docker ps

# Check logs if there's a problem
docker logs -f open-webui

# Stop it
docker stop open-webui

# Delete and start over (configuration remains because of the mounted volume)
docker rm open-webui

# Restart it
docker start open-webui

Performance Optimization: How to Run My 5060Ti 16GB at Full Power

Recommended Configuration (Adjust in LM Studio's "Model Settings" on the Right)

{
  "context_length": 8192,    // 8K is enough for daily use; don't enable 262K, it will blow VRAM
  "gpu_layers": 35,          // Put all layers on GPU for maximum speed
  "threads": 8,              // My CPU has 8 cores, use them all
  "batch_size": 512          // Default value, no need to change
}

Three Tips to Save VRAM (Learned the Hard Way)

  1. Don't Max Out the Context Qwen3.6 supports 262K tokens, but that's for multi-GPU rich folks. With 16GB VRAM, stick to 8K-16K, rock solid.

  2. Close Other VRAM-Hungry Programs Browsers, IDEs, Chrome are all VRAM killers. Before running the model, I close extra Chrome tabs to free up 1-2GB.

  3. If VRAM Still Explodes, Choose a Lower Quantization Q4_K_M is 16GB. If it still blows, try Q3_K_M (about 12GB). Quality drops slightly, but smoothness improves.


What Can I Do with It? (More Than Just Chatting)

1. Programming Assistant (My Most Common Use)

# Ask it to write code; it never complains about requirement changes
response = client.chat.completions.create(
    model="Qwen3.6-35B-A3B",
    messages=[
        {"role": "user", "content": "Write a quicksort in Python with comments"}
    ]
)

2. Private Document Q&A (Company Code Stays In-House)

Feed it internal company documents, ask anything, data never leaves the local server.

3. Integrate with LangChain (Get Creative)

from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(
    base_url="http://localhost:1234/v1",
    model="Qwen3.6-35B-A3B"
)

# Then it's exactly like calling GPT, but free, private, and unlimited

4. Sensitive Data Processing (Medical/Legal/Financial)

Can't upload client data to the cloud? Local model solves it perfectly. A friend of mine in medical IT uses this exact setup for processing medical record summaries — compliant and secure.


Frequently Asked Questions (All Pitfalls I've Encountered)

Q1: VRAM Exploded, What to Do?

Q2: Response Speed Is as Slow as a Snail?

Q3: Open WebUI Can't Connect to LM Studio?


Summary: My 5060Ti Has Become an All-Round Worker

Today, using LM Studio + Qwen3.6-35B-A3B, we unlocked another skill for the 5060Ti:

Zero-Barrier Local Deployment: LM Studio is ready out of the box; even a lazy person like me can run it ✅ MoE Architecture Saves VRAM: 35 billion parameters, only 3 billion activated, 16GB runs it perfectly ✅ Compatible with OpenAI API: Zero code changes to switch from cloud to local ✅ Open WebUI Enhancement: Beautiful interface, powerful features, file upload support ✅ Data Privacy: Sensitive information never leaves local, compliant and worry-free

Now my 5060Ti can draw, do voiceovers, and chat. Next time, should I teach it to write code? Oh wait, it's already helping me write code right now... Does that mean I'm going to be unemployed? 🤔

Code for this chapter — take it, no thanks needed.


Preview of Next Issue: Maybe I'll run video generation on the 5060Ti, or set up a multimodal model so it can "look at pictures and talk"? Anyway, I've already bought the graphics card and spent the money; I have to squeeze every drop of computing power out of it! 💪


Reference Links (Really Useful):


One Last Heartfelt Piece of Advice: No matter how powerful a local model is, it can't write a sentence like "I love you" — it will only give you a love poem and then ask, "Would you like to optimize it further?" So, don't fear unemployment; you still have a heart that can be moved, unlike AI. ❤️