The Three-Step Training Pipeline That Turns a Raw LLM Into a Deployable Specialist
How Are Large Models Trained?
From general pre-training to LoRA customized fine-tuning, combined with practical experience using Llama Factory and Ollama, this article breaks down the underlying principles of large models in plain language: vector embeddings, decoder architecture, multi-head attention, batch size and learning rate parameters. It explains the complete training path of a large model from a "general-purpose base" to "usable in production."
We previously discussed how the Chain-of-Thought (CoT) technique teaches large models to think step-by-step, and the ReAct framework enables them to think while acting and call upon tools to work. After reading, many people asked: Where do these capabilities actually come from? Does a large model know everything right from its inception?
Recently, I used Llama Factory to run through LoRA fine-tuning and then performed quantization deployment optimization via Ollama, completing the entire process from a base model to a customized, deployable solution. Today, I'll combine this practical experience to clearly explain the "training journey" of a large model in plain language, and along the way, dissect those often-heard but hard-to-understand underlying concepts like vectors, multi-head attention, and decoders.
1. Step One: Pre-training — Laying the "General Knowledge Foundation" for the Large Model
The basic large models we often talk about, such as Tongyi Qianwen, Llama, and Qwen, must first undergo large-scale pre-training. The goal at this stage is not to make it proficient in a specific domain, but to teach it "human language patterns and common-sense cognition."
The Essence of Pre-training: Learning "Word Solitaire" from Massive Text
Consistent with what we've discussed before, the core capability of a large model is "next-word prediction." During the pre-training phase, researchers feed trillions of tokens of text data to the model: books, web pages, papers, code, conversation logs—everything is included. The model doesn't need to "understand" the meaning of the text; it only needs to do one thing: based on the preceding text, calculate the probability of the next word/token appearing. The more text it learns from, the more accurate its grasp of language patterns becomes, and gradually it can generate fluent sentences that conform to human expression habits.
Three Unavoidable Core Concepts
Many concepts people have heard of—vectors, decoders, multi-head attention—are all shaped during this stage and form the foundation for a large model's ability to "comprehend semantics."
1. Vectors (Embeddings): The "Semantic Coordinates" of Text
Machines cannot read text; they can only process numbers. Vectors are the process of converting every word, every passage, into a string of numerical coordinates with semantic features. For example, the word "apple" is mapped to a coordinate point containing features like "fruit, red, healthy, tech company"; the closer the semantics of two words, the closer their vector coordinate distance. The "semantic retrieval" we mentioned when discussing RAG knowledge bases earlier essentially relies on vector distance to match similar content. The pre-training process is continuously refining this coordinate system, making its semantic judgments increasingly accurate.
2. Decoder: The Core Engine for Word-by-Word Generation
The GPT, Llama, and Qwen models we commonly use all belong to the "decoder-only" architecture, which is the core reason they can fluently generate long texts. The decoder's working logic is very pure: starting from the first word, every time a new word is generated, it takes all the already generated content as context and predicts the next most reasonable word, repeating this cycle until the end. It's like when we write an article, we look back at the previous content while writing to ensure logical coherence. The large model's decoder is doing exactly this, just at a millisecond-level speed.
3. Multi-Head Attention: Grasping Multiple Key Points Simultaneously
This is the soulful design that allows large models to precisely understand context. In simple terms, "attention" means that when the model generates the next word, it focuses on specific parts of the preceding content. "Multi-head" means running multiple sets of attention simultaneously, with each set focusing on a different dimension. For example, if a user says, "My food delivery was missing items, I want to request a refund":
- The first attention head focuses on "food delivery," anchoring the scenario;
- The second focuses on "missing items," identifying the problem type;
- The third focuses on "refund," capturing the user's intent. Working in parallel, the model can simultaneously grasp multiple key pieces of information, leading to a more comprehensive and accurate understanding, rather than making a judgment based on just one word.
After pre-training is complete, we have a "general-purpose base model." It's like a well-read graduate—broad in knowledge and logically coherent, but without specialized skills. It often lacks precision in vertical domains and is prone to hallucination.
2. Step Two: LoRA Fine-tuning — Training the General Model into a "Professional Specialist"
A base model can chat and write copy, but when placed in specific scenarios like intelligent customer service, industry Q&A, or code assistance, its performance often falls short. This is where "fine-tuning" comes in: using data from a specific domain to give the model "targeted training."
Why Choose LoRA? A Low-Cost Personal Fine-tuning Solution
Full-parameter fine-tuning requires modifying all of a model's parameters, often needing dozens or even hundreds of professional GPUs, which is unaffordable for individuals and small teams. LoRA (Low-Rank Adaptation) is currently the most mainstream lightweight fine-tuning solution: it does not alter the original parameters of the base model but only trains a very small portion of "bypass parameters," enabling the model to learn new domain knowledge and speaking styles. Its advantages are very clear:
- Low VRAM usage: Ordinary consumer-grade graphics cards can run 7B, 14B level models;
- Fast training speed: Results can be achieved in a few hours with just a few hundred data entries;
- Does not damage the base: LoRA files can be mounted separately and switched at any time, allowing one base model to adapt to multiple scenarios.
This time, I used the Llama Factory tool to perform LoRA fine-tuning on an open-source Chinese model for a customer service scenario. The entire process required no writing of training code from scratch, allowing me to focus solely on tuning parameters and preparing the dataset.
How to Tune Core Fine-tuning Parameters? Explained Through Practice
Many people new to fine-tuning are overwhelmed by a bunch of parameters. In reality, there are only three core ones, and adjusting them based on the scenario can yield good results.
1. Batch Size
Simply put, this is how many data entries are fed to the model at one time during training.
- Large batch: Training is more stable, gradients are more accurate, but it consumes a lot of VRAM;
- Small batch: Faster speed, saves VRAM, but training effects can easily fluctuate and go off track. In practice, I tried from 2 all the way up to 8, and ultimately chose 4 on my graphics card, balancing training stability and VRAM usage. If VRAM is tight, you can also enable gradient accumulation, trading time for space.
2. Learning Rate
This can be understood as the "step size" of fine-tuning, determining the magnitude of each parameter update by the model.
- Learning rate too high: The model easily learns incorrectly, distorting its original general knowledge;
- Learning rate too low: Training speed is slow, and effects are not obvious even after long training. Because LoRA fine-tuning only modifies a small number of parameters, the learning rate is generally set very small, usually in the range of 1e-4 to 5e-5. It's better to err on the side of being too small than too large, to avoid "ruining" the base model.
3. LoRA Rank
The size of the rank determines the dimensionality of the bypass parameters and directly affects fitting capability.
- Higher rank: Stronger fitting capability, easier to learn details, but also prone to overfitting, where it just memorizes the dataset without being able to generalize;
- Lower rank: Fewer parameters, more lightweight, but may not learn thoroughly in complex scenarios. For daily vertical scenario fine-tuning, such as customer service, copywriting, or specific style writing, a rank of 8 to 16 is sufficient and offers the best cost-performance ratio.
After training for a few hours on a few hundred customer service dialogue data entries, the model can precisely master the customer service speech style, process norms, and reply boundaries, achieving results significantly better than directly using the general-purpose base.
3. Step Three: Ollama Quantization Optimization — Making the Model Runnable in a Production Environment
The trained LoRA model cannot be directly deployed for use: a 7B model at 16-bit precision requires over a dozen GB of VRAM, which ordinary servers and personal computers simply cannot run. This is where quantization compression is needed. I used the GGUF format quantization from the Ollama ecosystem, which is currently the optimal solution for individuals and small teams to deploy.
What is Quantization? Trading Minimal Precision for Significant Performance Gains
Quantization is the process of compressing a model's high-precision parameters into lower precision. For example, from 16-bit floating point to 8-bit or even 4-bit integers. The intuitive feeling is:
- Model size is drastically reduced: A 7B model at 16-bit is about 14G, but after 4-bit quantization, it requires less than 4G;
- Inference speed is noticeably faster: Generation speed can more than double, and conversation latency is significantly reduced;
- Precision loss is negligible: In daily Q&A, customer service, and knowledge base scenarios, the effect of 4-bit quantization is almost indistinguishable from 16-bit, imperceptible to the average person.
After I exported the fine-tuned model to GGUF format and performed 4-bit quantization in Ollama, it ran smoothly on an ordinary office computer. Combined with the previously built semantic chunking knowledge base, the entire retrieval + generation process was smooth and fully capable of supporting the concurrency needs of a small-scale intelligent customer service system.
The Complete Deployment Link Closure
At this point, we can string all the previous content into a complete chain: General Pre-trained Base → LoRA Vertical Domain Fine-tuning → Ollama Quantization Deployment → Paired with RAG Knowledge Base + ReAct Framework → Production-Grade Intelligent Customer Service. From the model's own capabilities to the upper-level application architecture, every link corresponds to a real deployment requirement. Missing any step makes it very difficult to create a truly effective AI product.
Conclusion
From a base model that only knows how to play word solitaire to an intelligent Agent that can work and be deployed, there are countless details in between—pre-training, fine-tuning, quantization, knowledge bases, framework design. AI has never been magic; it is the result of layers of technology stacked together. Just like a person, it needs to build a foundation, train a specialty, and polish its deployment to go from "sounding impressive" to "being genuinely useful."
Have you ever tried fine-tuning a model? Or encountered performance bottlenecks when deploying AI applications? Feel free to share your experiences and pitfalls in the comments.
If you found this content useful, please like, share, and forward it to friends around you working in AI and technology.