跪拜 Guibai
← All articles
AIGC · AI Programming · Artificial Intelligence

The AI Engineer's Map: Where Every Model Format, Framework, and Deployment Tool Actually Belongs

By 吴佳浩Alben ·
Read original on juejin.cn ↗ Google Translate ↗ Alt translation

Without a pipeline model, every new tool looks like yet another thing to learn, and teams waste months evaluating tools that solve problems they don't have. Slotting a tool into its correct stage—and knowing which role owns that stage—turns an overwhelming landscape into a set of deliberate, scoped decisions.

Summary

The flood of AI tooling names—GGUF, Safetensors, ONNX, LoRA, vLLM, SGLang, Ollama, MLX—are not competing alternatives but distinct stages in a single pipeline. Training happens in PyTorch; fine-tuning uses LoRA to shrink a full parameter update into a tiny, pluggable delta file. The resulting weights get saved in Safetensors for security, GGUF for quantized local CPU inference, or ONNX for cross-platform enterprise deployment. Loading libraries like Transformers, Diffusers, and sentence-transformers provide unified interfaces, while inference optimizers such as MLX and OpenVINO target specific silicon. At the end of the line, deployment splits between local-first tools (Ollama, Llamafile), enterprise platforms (Xinference), and high-concurrency serving engines (vLLM with PagedAttention, SGLang with RadixAttention).

Each stage maps to a different engineering role. Algorithm engineers own training and fine-tuning. Infra engineers own format conversion, inference optimization, and deployment. Application engineers consume deployed APIs and work with embedding models. MLOps engineers run the production serving layer. The map is not a curriculum to complete but a compass for deciding what to ignore.

Takeaways
PyTorch dominates training because its dynamic computation graph allows line-by-line Python debugging, and the entire Hugging Face ecosystem is built on it.
LoRA freezes a model's original weights and inserts tiny low-rank matrices, cutting trainable parameters to 0.1–1% of the full model and making consumer-GPU fine-tuning practical.
Safetensors replaces pickle-based weight files with a pure-data format that eliminates remote code execution risk and loads faster via zero-copy and memory mapping.
GGUF bundles quantized weights, tokenizer, and metadata into a single file, enabling CPU-only inference on laptops and phones through the llama.cpp runtime.
ONNX acts as a framework-neutral intermediate representation; a model exported once can run on NVIDIA GPUs, Intel CPUs, or Apple Neural Engine via ONNX Runtime execution providers.
Ollama wraps llama.cpp into a one-command local deployment with an HTTP API, targeting individual developers and demos, not production concurrency.
vLLM uses PagedAttention to manage KV cache in non-contiguous blocks and Continuous Batching to insert new requests mid-batch, maximizing GPU throughput for multi-user serving.
SGLang adds RadixAttention, a radix-tree-based prefix cache that reuses KV cache across requests sharing common prefixes, giving it an edge in multi-turn chat and agent workloads.
Xinference is a private model-as-a-service platform that unifies deployment of LLMs, embedding models, rerankers, and ASR under one API gateway with lifecycle management.
Llamafile compiles model weights and the llama.cpp engine into a single cross-platform executable using Cosmopolitan Libc, requiring zero dependencies on the target machine.
MLX exploits Apple Silicon's unified memory to avoid CPU-GPU data copies, while OpenVINO optimizes inference across Intel CPUs, GPUs, and NPUs from a single IR format.
Algorithm engineers own training and fine-tuning; Infra engineers own formats, optimization, and deployment; application engineers consume APIs; MLOps engineers run production serving.
Conclusions

The proliferation of model formats is not fragmentation but specialization: Safetensors solves security, GGUF solves local CPU inference, and ONNX solves cross-platform enterprise deployment.

LoRA's real impact is not just memory savings but an economic shift—fine-tuning moves from a server-farm activity to something a single developer does on a gaming GPU, which explains the explosion of community fine-tunes on Hugging Face.

Ollama's success reveals that developer experience, not raw performance, is the bottleneck for local LLM adoption; a single `ollama run` command beat years of complex setup scripts.

The vLLM vs. SGLang split mirrors a broader pattern: vLLM optimizes for generic high-throughput serving, while SGLang optimizes for the specific access patterns of agentic and multi-turn workloads where prefix reuse dominates.

Xinference's multi-model platform approach acknowledges that production AI systems are rarely just an LLM—they are pipelines of embedding, reranking, generation, and speech models that need coordinated deployment.

Llamafile's cross-platform executable is a distribution hack, not a serving architecture; its value is eliminating the 'install Python, create a venv, pip install...' ritual for non-technical users.

The role-based map is the most underrated insight: an algorithm engineer who tries to learn vLLM scheduling or an Infra engineer who studies LoRA math is optimizing the wrong variable. Knowing what not to learn is the real skill.

TensorFlow's decline is a cautionary tale about ecosystem gravity—technical merit alone could not overcome the community's collective decision to standardize on PyTorch for research and publishing.

Concepts & terms
LoRA (Low-Rank Adaptation)
A fine-tuning method that freezes a pre-trained model's weights and inserts small, trainable low-rank matrices into attention layers. It reduces trainable parameters by 99%+, enabling fine-tuning on consumer GPUs and producing tiny, swappable adapter files.
Safetensors
A model weight format that stores only tensor data (shape, dtype, raw bytes) with no executable code, eliminating the remote code execution risk of pickle-based formats. Supports zero-copy and memory-mapped loading for speed.
GGUF (GGML Universal Format)
A self-contained model file format from the llama.cpp project that packages quantized weights, tokenizer, and metadata into a single file. Designed for CPU-friendly, low-precision inference on consumer hardware without a GPU.
ONNX (Open Neural Network Exchange)
A framework-agnostic intermediate representation for neural networks. Models exported from PyTorch or TensorFlow to ONNX can run on diverse hardware via ONNX Runtime execution providers (CUDA, OpenVINO, CoreML, etc.).
PagedAttention
vLLM's core innovation that manages the KV cache in non-contiguous blocks, analogous to virtual memory paging. It eliminates fragmentation waste, allowing more requests to share limited GPU memory.
Continuous Batching
A serving technique where new inference requests can join an in-progress batch immediately, rather than waiting for the current batch to finish. It keeps the GPU fed and significantly increases throughput under concurrent load.
RadixAttention
SGLang's prefix-caching mechanism that uses a radix tree to automatically detect and reuse KV cache for common prefixes across requests (e.g., shared system prompts or conversation history), avoiding redundant computation.
MLX
Apple's machine learning framework optimized for M-series chips. It leverages unified memory (CPU and GPU share one pool) to eliminate data copies, with a NumPy/PyTorch-like API for familiar ergonomics.
OpenVINO
Intel's inference optimization toolkit that converts models to an intermediate representation and then executes them on Intel CPUs, GPUs, or NPUs using hardware-specific instruction sets and runtimes.
Cosmopolitan Libc
A C library that produces 'Actually Portable Executables'—binaries that are simultaneously valid Windows PE, macOS Mach-O, and Linux ELF files, enabling a single file to run natively across operating systems.
Source: juejin.cn ↗ Google Translate ↗ Backup ↗