The AI Engineer's Map: Where Every Model Format, Framework, and Deployment Tool Actually Belongs
AI Engineer Knowledge Map: Model Formats, Frameworks, and Deployment Tools Explained Once and for All
Author: 吴佳浩 Alben
Writing time: 2026.6.28
Introduction
In the last couple of years, the AI space has been churning out new terms faster than anyone can keep up—yesterday it was PyTorch and TensorFlow, today it's GGUF, Safetensors, and ONNX, and tomorrow it might be Ollama, vLLM, or SGLang. Many people (myself included when I first started) instinctively try to memorize them as isolated facts, but end up with a head full of jargon and no coherent picture, still unsure where to slot in the next new tool.
In reality, these terms are never on the same level; they are distributed across different stages of the same AI development pipeline: a model is first trained, then fine-tuned, then saved in a certain format, loaded and invoked by a library, optimized for inference, and finally deployed as a service for real applications. Once you straighten out this pipeline, nearly every term immediately finds its place.
1. Why Does the AI World Have So Many New Terms?
- Why do we have PyTorch and also Transformers?
- GGUF, Safetensors, and ONNX are all model file formats—why do we need three?
- Is LoRA a model?
- What is the relationship between Ollama, Llamafile, and Xinference?
These terms are not on the same level; they belong to different stages of the AI development workflow.
2. A Single Diagram to Understand the AI Engineer Tech Stack (Panoramic View)
3. Model Training: How Is a Model Forged?
3.1 PyTorch
What it is: An open-source deep learning framework from Meta (formerly Facebook), centered on tensor computation and automatic differentiation (Autograd). Developers define network architectures and loss functions in Python; PyTorch automatically computes gradients and performs backpropagation—no need to hand-write derivative formulas.
Why it became the de facto standard for large model training:
- Dynamic Graphs (Eager Execution): The model structure is determined at runtime, allowing developers to debug by setting breakpoints and printing intermediate tensors just like normal Python code. This is crucial for researchers iterating quickly. Early TensorFlow used static graphs, requiring compilation of the computation graph before execution, which made debugging much harder.
- Ecosystem Aggregation: Nearly all mainstream large model training and inference frameworks—Hugging Face Transformers, DeepSpeed, Megatron-LM, vLLM—prefer PyTorch as their foundation. Papers default to releasing PyTorch implementations, leading more people to find PyTorch alone sufficient.
- Dual Penetration in Industry and Academia: Starting around 2019, PyTorch implementations surpassed TensorFlow in top-tier conference papers, establishing the convention that "new models ship with PyTorch first."
Pros and Cons:
- Pros: Quick to pick up, debugging-friendly, richest community resources, seamless integration with the Hugging Face ecosystem.
- Cons: Mobile/embedded deployment is less mature than TensorFlow Lite, often requiring extra conversion (e.g., to ONNX) to run on certain hardware.
# Minimal PyTorch training example: define model, compute gradients, backpropagate
import torch
import torch.nn as nn
model = nn.Linear(10, 1)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
loss_fn = nn.MSELoss()
x = torch.randn(32, 10)
y = torch.randn(32, 1)
for epoch in range(100):
optimizer.zero_grad()
pred = model(x)
loss = loss_fn(pred, y)
loss.backward() # Automatic differentiation, automatically computes gradients
optimizer.step() # Update parameters
3.2 TensorFlow
From Google: Open-sourced by the Google Brain team in 2015, originally designed around static computation graphs, requiring the entire network to be defined in a tf.Graph before executing via Session.run(). This offered large optimization potential but was cumbersome to write.
Why it was once the hottest: Between 2016 and 2019, TensorFlow was synonymous with industrial deployment—TensorFlow Serving for production inference, TensorFlow Lite for mobile and embedded devices, and TensorBoard for visualizing training. This complete suite of supporting tools was far more mature than PyTorch's at the time, leading to large-scale adoption by big companies (especially those relying on Google Cloud).
Why it's now mostly maintained for legacy projects: Although TensorFlow 2.x introduced Eager Execution to catch up with PyTorch's ease of use, the ecosystem migration cost was too high. Coupled with the open-source community's collective shift to PyTorch in the large model era, new projects rarely choose TensorFlow anymore. It mostly appears in long-running, hard-to-rewrite legacy systems (like certain recommendation systems or mobile CV models).
Final comparison:
| Framework | Positioning | Suitable Scenarios |
|---|---|---|
| PyTorch | Mainstream training framework | LLMs, CV, research |
| TensorFlow | Deep learning framework | Enterprise, mobile, legacy projects |
4. Model Fine-tuning: Why Not Retrain Trillions of Parameters?
4.1 LoRA
Why it emerged: Fully fine-tuning a model with tens or hundreds of billions of parameters means storing gradients and optimizer states for every parameter (the Adam optimizer requires saving first and second moment estimates for each parameter). Memory usage is typically several times the parameter count—a consumer-grade GPU simply cannot fit it. LoRA (Low-Rank Adaptation), proposed by Microsoft in 2021, takes a different approach: freeze all original model parameters and insert two small low-rank matrices, A and B, next to key layers (like the Q and V matrices in Attention). Only these two small matrices are updated during training.
Why memory requirements are low: Suppose the original weight matrix is a large d×d matrix. LoRA decomposes it into two small matrices, d×r and r×d (where r is much smaller than d, typically 4, 8, or 16). The number of parameters needing training and gradient storage plummets from d² to 2dr, often accounting for only 0.1%–1% of the original model's parameters. This means a model that previously required 80GB of VRAM for full fine-tuning might run with just over 10GB using LoRA, making it possible for individuals to fine-tune large models on consumer GPUs like the RTX 4090.
Why Hugging Face is full of LoRAs: After training, LoRA produces a very small "delta weight file" (usually tens to hundreds of MB), eliminating the need to redistribute the entire multi-GB base model. During use, the LoRA weights are simply merged back into the original model. This "lightweight, pluggable, and stackable" nature has led to a proliferation of LoRA weights on Hugging Face for different tasks and styles (especially style LoRAs for Stable Diffusion and domain-specific fine-tuning LoRAs for LLMs), causing the ecosystem to flourish rapidly.
# Using the peft library to add LoRA fine-tuning to a large model (Hugging Face model example)
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
lora_config = LoraConfig(
r=8, # Rank r of the low-rank matrices
lora_alpha=16,
target_modules=["q_proj", "v_proj"], # Insert LoRA only on Q, V matrices
lora_dropout=0.05,
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output similar to: trainable params: 4,194,304 || all params: 7,615,616,512 || trainable%: 0.055%
5. Model Saving: Why Are There So Many Model Formats?
5.1 Safetensors
Why it's safe: Before Safetensors, model weights were commonly saved using Python's pickle format (like PyTorch's .bin/.pt files). Deserializing a pickle file executes arbitrary code embedded within it, meaning downloading a model file from an untrusted source could execute malicious code on your machine (a remote code execution vulnerability). Safetensors, designed by Hugging Face, stores only "pure data" (tensor shapes, types, and raw byte content) in its file format, containing no executable code. Loading it simply reads bytes, fundamentally eliminating this supply chain poisoning risk.
Why Hugging Face recommends it by default: Beyond security, Safetensors supports zero-copy and memory-mapped (mmap) loading, which is faster than pickle deserialization, especially noticeable when loading models tens of GB in size. Consequently, Hugging Face Hub has set Safetensors as the default recommended format for model uploads and loading since 2023, and new models almost always provide a .safetensors version.
5.2 GGUF
Why Ollama uses it: GGUF (GGML Universal Format) is a model file format designed by the llama.cpp project, specifically built for quantization and CPU-friendly inference. It compresses model weights from FP16/FP32 down to lower precisions like 4-bit, 5-bit, or 8-bit, shrinking the file size to 1/4 or even less. It also packages the tokenizer, model hyperparameters, and other metadata into a single file, achieving "one file to run." Ollama relies on llama.cpp for inference under the hood, so it naturally uses GGUF as its standard model format.
Why llama.cpp supports it: GGUF was designed and is maintained by the llama.cpp author Georgi Gerganov's team (its predecessor was GGML). They share the same technical lineage, and all tools derived from the llama.cpp ecosystem (Ollama, LM Studio, text-generation-webui, etc.) natively treat GGUF as a first-class citizen.
Why it's suitable for CPU and local deployment: GGUF, combined with llama.cpp's quantized inference kernels, can run models with over 10B parameters without a GPU, relying solely on CPU (or even a phone chip). This is crucial for individual developers deploying large models locally on a laptop without depending on cloud APIs.
5.3 ONNX
Why enterprises favor it for deployment: ONNX (Open Neural Network Exchange), jointly initiated by Microsoft and Meta, is positioned as an "intermediate language for models." Enterprises often train with PyTorch, but production environments might require running models on Java backends, mobile devices, or dedicated inference chips. Using the PyTorch runtime directly is too heavy and has too many dependencies. Exporting the model to ONNX allows it to be loaded and executed with the lighter, higher-performance ONNX Runtime, decoupling it from the original training framework's dependencies.
Why it's cross-framework: ONNX defines a set of standard operators independent of any specific framework. PyTorch, TensorFlow, scikit-learn, and others can all export to ONNX format, theoretically enabling "train once, deploy anywhere" and preventing vendor lock-in to a single training framework.
Why it supports various hardware: ONNX Runtime provides Execution Providers (EPs) for different hardware backends—like CUDA EP for NVIDIA GPUs, OpenVINO EP for Intel chips, and CoreML EP for Apple devices. The same ONNX model file can be executed on different hardware using the optimal underlying acceleration library for each, which is why it has long held a place in enterprise cross-platform deployment.
Final comparison table:
| Format | Purpose | Best For |
|---|---|---|
| Safetensors | Saving training weights | Hugging Face |
| GGUF | Local inference | Ollama, llama.cpp |
| ONNX | Cross-platform deployment | Enterprise inference |
# Safetensors: Safe loading, zero-copy
from safetensors.torch import save_file, load_file
tensors = {"weight": torch.randn(768, 768)}
save_file(tensors, "model.safetensors")
loaded = load_file("model.safetensors") # Only reads bytes, executes no code
# GGUF: Use llama.cpp's conversion script to quantize a Hugging Face model to GGUF
# python convert_hf_to_gguf.py ./my-model --outfile model.gguf
# ./llama-quantize model.gguf model-q4_k_m.gguf Q4_K_M # 4-bit quantization
# ONNX: Export a PyTorch model to ONNX, use ONNX Runtime for cross-platform inference
import torch.onnx
torch.onnx.export(model, x, "model.onnx", input_names=["input"], output_names=["output"])
import onnxruntime as ort
session = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])
result = session.run(None, {"input": x.numpy()})
6. Model Loading: Why Can a Model Run with a Single Line of Code?
6.1 Transformers
Handles NLP and large language models. The Hugging Face Transformers library wraps models of different architectures (BERT, GPT, LLaMA, Qwen, etc.) into a consistent interface—from_pretrained() loads weights, tokenizer handles text tokenization, and generate() performs text generation. Developers don't need to understand the internal structural differences of each model; switching models is basically changing one line for the model name. This "unified interface, abstracting away underlying differences" design is the core reason it has become the de facto standard library for NLP/LLMs.
6.2 Diffusers
Handles Stable Diffusion.
Handles FLUX.
Handles various image generation models. Diffusers is a library built by Hugging Face specifically for diffusion models, encapsulating the "noising-denoising" image (and audio/video) generation process into Pipelines. It provides a unified interface to call different text-to-image models (Stable Diffusion series, FLUX, etc.) and includes standard implementations of diffusion-specific components like VAE, UNet, and Scheduler, saving developers the trouble of assembling this mathematical process from scratch.
6.3 sentence-transformers
Embedding.
RAG.
Vector databases.
Semantic search. sentence-transformers focuses on converting a piece of text into a fixed-length vector (Embedding) that represents its semantic meaning. It is a key component of RAG (Retrieval-Augmented Generation) systems: knowledge base documents are first converted into vectors and stored in a vector database. User queries are also converted into vectors, and a similarity search retrieves the most relevant document chunks to feed to the LLM for answering, thus enabling semantic search (matching by "similar meaning," not just keyword matching).
Final table:
| Library | What it loads |
|---|---|
| Transformers | LLMs |
| Diffusers | Text-to-image |
| sentence-transformers | Embeddings |
# Transformers: Load and invoke a large language model
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
inputs = tokenizer("Explain LoRA in one sentence", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Diffusers: Load and invoke Stable Diffusion for text-to-image
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1")
image = pipe("A cat playing guitar on the moon").images[0]
image.save("output.png")
# sentence-transformers: Convert text to Embedding vectors
from sentence_transformers import SentenceTransformer
embed_model = SentenceTransformer("BAAI/bge-large-zh-v1.5")
vectors = embed_model.encode(["What is LoRA", "What is low-rank adaptation"])
# Compare two vectors using cosine similarity; high similarity means semantically close, usable for semantic search / RAG retrieval
7. Inference Optimization: Why Is the Same Model Ten Times Faster for Some?
7.1 MLX
Exclusive to Apple Silicon. MLX is an official machine learning framework from Apple, specifically optimized for the unified memory architecture of M-series chips (M1/M2/M3/M4), where CPU and GPU share the same memory pool, eliminating the need to copy data back and forth between CPU RAM and GPU VRAM as with discrete graphics cards. This allows for more efficient use of Mac's compute power for large model inference and even lightweight training.
Why Mac users are talking about it. Because Macs (especially high-memory M-series models) lack discrete NVIDIA GPUs, traditional CUDA-based inference/training frameworks are completely unusable. MLX gives Mac users, for the first time, a way to run large models locally with near-native efficiency. Its syntax is deliberately designed to feel close to PyTorch/NumPy, lowering the learning curve, which is why it generates significant discussion among Mac developers and local LLM enthusiasts.
7.2 OpenVINO
An official inference optimization toolkit from Intel (OpenVINO = Open Visual Inference & Neural Network Optimization).
- CPU: Optimizes computation graph operators for Intel CPU instruction sets (like AVX-512), enabling usable inference speeds on servers or PCs without discrete GPUs.
- GPU: Supports acceleration on Intel integrated graphics / discrete GPUs (Arc series).
- NPU: Supports the Neural Processing Units built into new Intel Core Ultra processors, designed specifically for AI inference with lower power consumption.
- Unified Optimization: Developers only need to convert a model to OpenVINO's Intermediate Representation (IR) format once. At runtime, it can automatically select whether to execute on CPU, GPU, or NPU based on the current device, eliminating the need to write separate inference code for each hardware type. This "convert once, adapt to multiple hardware" capability is its core value within the Intel ecosystem.
# MLX: Run large model inference on Apple Silicon
import mlx.core as mx
from mlx_lm import load, generate
model, tokenizer = load("mlx-community/Qwen2.5-7B-Instruct-4bit")
response = generate(model, tokenizer, prompt="Explain MLX in one sentence", max_tokens=50)
print(response)
# OpenVINO: Convert model to IR format, automatically select CPU/GPU/NPU for execution
from optimum.intel import OVModelForCausalLM
from transformers import AutoTokenizer
model = OVModelForCausalLM.from_pretrained("model_id", export=True, device="CPU")
tokenizer = AutoTokenizer.from_pretrained("model_id")
8. Model Deployment: How Does the Model Finally Serve Requests?
8.1 Ollama
Why it's the hottest right now: Ollama simplifies the entire process of "downloading a model, configuring the environment, and starting an inference service"—which is typically cumbersome for beginners—into a single command: ollama run llama3. Behind the scenes, it automatically handles model downloading, quantization, loading, and exposing a local HTTP API. This "out-of-the-box" experience is the core reason for its rapid rise in popularity among individual developers and local deployment enthusiasts.
Why GGUF is supported by default: Ollama's inference kernel is based on llama.cpp, so it natively supports the GGUF format. Models in the Ollama Library are also pre-converted and packaged GGUF files; users don't need to worry about format conversion details.
When it's suitable: Running large models locally on a personal computer for experiments, demos, or privacy-sensitive scenarios where you don't want to call cloud APIs, or just quickly testing an open-source model's performance. Ollama is currently the lowest-barrier option. However, for high-concurrency, multi-user production environments, its throughput and concurrency capabilities are not designed for that scale.
# Ollama: Pull and run a model with one command, automatically exposes a local API
ollama run llama3
# After starting, call directly via HTTP
curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt": "Explain Ollama in one sentence"
}'
8.2 Xinference
Unified deployment.
LLM.
Embedding.
Reranker.
ASR.
Unified management. Xinference (Xorbits Inference) has a different positioning from Ollama; it targets enterprise/team-level scenarios: a single platform for unified management and deployment of multiple types of AI models—not just large language models (LLMs), but also Embedding models, Rerankers (re-ranking, often used for fine-ranking in RAG), ASR (Automatic Speech Recognition), and more. It provides a unified API gateway and model lifecycle management (start, stop, scale), essentially functioning as a "private Model-as-a-Service (MaaS) platform."
Why enterprises are increasingly favoring it: Internally, companies often need several types of models to work together (e.g., a RAG system needs an Embedding model for retrieval, a Reranker for fine-ranking, and an LLM for final generation). If each model type is deployed and managed with different tools, operational costs are high. Xinference unifies these model types under a single deployment and invocation system, and its support for distributed deployment and horizontal scaling to handle high concurrency makes it more suitable for enterprise production environments.
# Xinference: Start the service, uniformly register and deploy LLM / Embedding / Reranker
xinference-local --host 0.0.0.0 --port 9997
# Launch an LLM model instance via command line
xinference launch --model-name qwen2.5-instruct --model-format pytorch --size-in-billions 7
# Call via Python SDK
from xinference.client import Client
client = Client("http://localhost:9997")
model = client.get_model("qwen2.5-instruct")
model.chat(messages=[{"role": "user", "content": "Hello"}])
8.3 Llamafile
Why a single exe can run a model: Llamafile, built by a Mozilla-backed team based on llama.cpp, uses a technology called Cosmopolitan Libc to package model weights and the llama.cpp inference engine into a single executable file. This executable uses a "multi-format compatible" binary construction method, allowing the same file to run directly by double-clicking or via command line on Windows, macOS, and Linux, without requiring users to install a Python environment or a bunch of pip dependencies.
Underlying principle: It relies on the "Actually Portable Executable" technology implemented by the Cosmopolitan Libc project, making the same machine code appear as a valid format to the executable loaders of different operating systems, thus achieving cross-platform, zero-dependency execution.
Suitable scenarios: Distributing a program capable of running a large model to users with no technical background (e.g., for internal tools, demoing to non-technical colleagues). The recipient's computer needs no environment setup; downloading a single file and double-clicking it is enough. This "ultimate distribution convenience" is Llamafile's core use case, though flexibility and customizability are correspondingly lower than Ollama or using llama.cpp directly.
# Llamafile: Download a file, grant execute permission, run directly
chmod +x llava-v1.5-7b-q4.llamafile
./llava-v1.5-7b-q4.llamafile
# Automatically starts a service with a Web UI locally; open http://localhost:8080 in a browser to chat
8.4 vLLM
Why it's the top choice for production-grade high-concurrency inference: vLLM, proposed by a team at UC Berkeley, features PagedAttention as its core innovation. It manages the KV Cache (Key-Value Cache) that must be stored during Attention computation using an approach similar to virtual memory paging in operating systems. It does not require each request's KV Cache to occupy contiguous VRAM space, significantly reducing VRAM fragmentation and waste. It also supports Continuous Batching, allowing new requests to be inserted into an ongoing batch at any time without waiting for the entire batch to finish, significantly boosting GPU utilization and throughput.
Suitable scenarios: Online service scenarios facing simultaneous requests from multiple users (e.g., providing a public Chat API, an internal LLM gateway shared by many employees). Compared to tools like Ollama and Llamafile, which are designed for single-machine local use, vLLM is a true inference engine built for high-concurrency production environments, typically deployed on GPU-equipped servers.
# Install and start a service compatible with the OpenAI API format
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-7B-Instruct \
--port 8000
# Call directly using the OpenAI SDK method (fully compatible interface)
curl http://localhost:8000/v1/chat/completions -d '{
"model": "Qwen/Qwen2.5-7B-Instruct",
"messages": [{"role": "user", "content": "Explain vLLM in one sentence"}]
}'
8.5 SGLang
Why it claims to be faster than vLLM: SGLang, launched by LMSYS (the team behind Vicuna and Chatbot Arena), supports PagedAttention and Continuous Batching like vLLM, but its core highlight is RadixAttention. It uses a Radix Tree structure to automatically reuse the common prefix KV Cache across different requests (e.g., multiple users using the same System Prompt, or repeated historical context in multi-turn conversations). When a prefix hits, the cache can be reused directly, saving redundant computation. In scenarios with many repeated prefixes, such as multi-turn dialogues or Agent tool calls, its throughput and latency advantages over vLLM are more pronounced.
Suitable scenarios: Complex structured generation tasks, such as scenarios requiring multiple model calls, Prompts with many Few-shot examples, Agent multi-step reasoning, and batch structured output (JSON mode). SGLang also provides a dedicated front-end language (SGLang Language) to describe such multi-step generation logic, making complex call chains more concise to write.
# Install and start the service (also compatible with OpenAI API format)
pip install "sglang[all]"
python -m sglang.launch_server \
--model-path Qwen/Qwen2.5-7B-Instruct \
--port 30000
# Python call example
import sglang as sgl
@sgl.function
def multi_turn_chat(s, question):
s += sgl.user(question)
s += sgl.assistant(sgl.gen("answer", max_tokens=100))
state = multi_turn_chat.run(question="Explain SGLang in one sentence")
print(state["answer"])
Finally, comparing vLLM and SGLang with the previous deployment tools:
| Tool | Positioning | Core Technology | Best For |
|---|---|---|---|
| Ollama | Local single-machine deployment | Based on llama.cpp, GGUF quantization | Individual developers running models locally, demos |
| Llamafile | Single-file distribution | Cosmopolitan Libc Actually Portable Executable | Zero-dependency distribution to non-technical users |
| Xinference | Enterprise unified deployment platform | Unified management of LLM/Embedding/Reranker/ASR | Team-level multi-model collaborative production environments |
| vLLM | High-concurrency production inference | PagedAttention + Continuous Batching | Providing external APIs, multi-user concurrent services |
| SGLang | High-concurrency + complex call chains | RadixAttention prefix reuse | Multi-turn dialogue, Agents, structured batch generation |
9. What Exactly Is the Relationship Between Them?
10. A Mindmap Overview of All Knowledge Points
11. For Different Roles, Which Layers Should You Focus On?
On the same technical map, the layers that Algorithm Engineers, Infra Engineers, Application Engineers, and MLOps Engineers focus on are completely different. Overlaying roles onto the map is more instructive than simply listing technologies.
11.1 Algorithm Engineer
Main battlefield: Model Training layer (PyTorch / TensorFlow) + Model Fine-tuning layer (LoRA).
The core work of an algorithm engineer is designing network architectures, tuning hyperparameters, running experiments, and improving model performance. Therefore, PyTorch's automatic differentiation mechanism, training loops, and the low-rank decomposition principles of LoRA are areas requiring deep mastery.
Secondary understanding: Model Loading layer (Transformers / Diffusers)—you always need to verify a model's performance after training, so this layer needs to be usable, but not mastered.
Basically untouched: Model format conversion (GGUF quantization, ONNX export), Inference Optimization (MLX / OpenVINO), Model Deployment (Ollama / vLLM / SGLang / Xinference)—these are typically handed off to Infra Engineers.
11.2 Infra / Platform Engineer
Main battlefield: Model Format and Conversion layer (Safetensors / GGUF / ONNX) + Inference Optimization layer (MLX / OpenVINO) + Model Deployment layer (Ollama / Llamafile / Xinference / vLLM / SGLang).
Infra Engineers must ensure that the models trained by algorithm engineers can run stably, efficiently, and cost-effectively—choosing the right quantization precision, selecting the appropriate inference engine, deciding between vLLM and SGLang based on concurrency, and determining whether to integrate OpenVINO based on hardware. This entire chain is their core responsibility.
Secondary understanding: Basic concepts of the Model Training layer (knowing roughly how PyTorch trains a model is enough; no need to tune hyperparameters).
Basically untouched: LoRA fine-tuning details, business-specific loading libraries like Diffusers/sentence-transformers.
11.3 Application / Full-Stack AI Engineer (RAG, Agent, and other business directions)
Main battlefield: sentence-transformers in the Model Loading layer (for creating Embeddings, interfacing with vector databases) + calling already-deployed model service APIs (regardless of whether the underlying engine is Ollama or vLLM, they only care about the invocation interface).
These engineers typically do not train models or handle low-level deployment. Instead, they integrate the model services deployed by Infra Engineers into specific business logic via APIs (building RAG retrieval pipelines, designing Agent tool-calling chains, writing Prompts).
Secondary understanding: Basic invocation methods for Transformers / Diffusers, convenient for small-scale local experiments and performance verification.
Basically untouched: Model training, LoRA fine-tuning, low-level inference optimization, quantization format conversion.
11.4 MLOps / Deployment Operations Engineer
Main battlefield: The production-grade service part of the Model Deployment layer (vLLM / SGLang / Xinference)—monitoring GPU utilization, configuring auto-scaling, ensuring service SLAs, performing canary releases and version rollbacks.
This role sits between Infra Engineers and Application Engineers, leaning more towards operations and stability assurance rather than performance optimization itself (specific technical choices for performance optimization are generally decided by Infra Engineers; MLOps is responsible for landing those choices into a monitorable, operable production system).
Secondary understanding: Model formats (knowing the difference between GGUF and Safetensors to help troubleshoot deployment issues).
Basically untouched: Model training, fine-tuning, specific invocation details of model loading libraries.
11.5 Role × Technology Layer Quick Reference Table
| Technology Layer | Algorithm Engineer | Infra Engineer | Application/Full-Stack Engineer | MLOps Engineer |
|---|---|---|---|---|
| Model Training (PyTorch/TensorFlow) | 🔴 Main Battlefield | ⚪ Understand | ⚪ Basically Untouched | ⚪ Basically Untouched |
| Model Fine-tuning (LoRA) | 🔴 Main Battlefield | ⚪ Basically Untouched | ⚪ Basically Untouched | ⚪ Basically Untouched |
| Model Formats (Safetensors/GGUF/ONNX) | ⚪ Basically Untouched | 🔴 Main Battlefield | ⚪ Basically Untouched | 🟡 Secondary Understanding |
| Model Loading (Transformers/Diffusers) | 🟡 Secondary Understanding | ⚪ Basically Untouched | 🟡 Secondary Understanding | ⚪ Basically Untouched |
| Model Loading (sentence-transformers) | ⚪ Basically Untouched | ⚪ Basically Untouched | 🔴 Main Battlefield | ⚪ Basically Untouched |
| Inference Optimization (MLX/OpenVINO) | ⚪ Basically Untouched | 🔴 Main Battlefield | ⚪ Basically Untouched | ⚪ Basically Untouched |
| Deployment (Ollama/Llamafile/Xinference) | ⚪ Basically Untouched | 🔴 Main Battlefield | 🟡 Just call the API | 🟡 Secondary Understanding |
| Deployment (vLLM/SGLang Production Services) | ⚪ Basically Untouched | 🔴 Main Battlefield | 🟡 Just call the API | 🔴 Main Battlefield (Ops Assurance) |
11.6 Technical Map from Role Perspectives
12. Essential AI Engineer Tech Stack Quick Reference Table
| Category | Technology | One-Line Positioning |
|---|---|---|
| Training Frameworks | PyTorch, TensorFlow | Train AI models |
| Fine-tuning Tech | LoRA | Low-cost model fine-tuning |
| Model Formats | Safetensors, GGUF, ONNX | Save, convert, and distribute models |
| Model Libraries | Transformers, Diffusers, sentence-transformers | Load and use different types of models |
| Inference Optimization | MLX, OpenVINO | Optimize model inference for hardware |
| Inference Engine | llama.cpp | Core runtime engine for GGUF models |
| Deployment Tools | Ollama, Llamafile | Quickly run models locally |
| Deployment Platform | Xinference | Unified deployment and management of multiple AI models |
| Inference Service Frameworks | vLLM, SGLang | High-concurrency production-grade model services |
Summary
The purpose of a technical map is not to master every part of it, but to locate yourself on it. An algorithm engineer doesn't need to understand the scheduling details of vLLM, and an Infra engineer doesn't need to delve into the mathematical derivation of LoRA. Knowing the full map and clearly understanding which part you should dig deep into is far more important than blindly trying to learn everything.
If this article helped you clarify a term you never quite understood, or helped you confirm which direction to dive into next, then it has served its purpose.