GGUF vs SafeTensors：两种模型存储格式对比

对象： GGUF（llama.cpp 生态） vs SafeTensors（HuggingFace 生态）
前置阅读： llama.cpp 量化全解 · GPTQ · AWQ

一句话总结： SafeTensors 是训练侧的安全交换格式，GGUF 是部署侧的自包含分发包——前者管”存得安全”，后者管”拿到就跑”。

一、为什么需要两种格式

深度学习模型本质上就是一堆张量加上描述它们怎么组装的元数据。但”怎么存这堆张量”这件事，训练和部署有完全不同的需求：

训练侧：                                部署侧：
├─ 需要原始精度（F32/BF16/F16）         ├─ 需要量化（Q4_K_M / Q6_K / IQ2_S）
├─ 多文件分片 → 多卡并行加载             ├─ 单文件 → 拷一个文件就能跑
├─ 权重和配置分开存 → 灵活组合            ├─ 全部打包 → 不依赖任何外部文件
├─ 安全反序列化 → 替代 pickle            ├─ mmap 零拷贝 → CPU/边端快速启动
└─ 生态：transformers / vLLM / TGI      └─ 生态：llama.cpp / ollama / LM Studio

SafeTensors 解决前者，GGUF 解决后者。

二、文件结构逐字节对比

2.1 GGUF：二进制自包含格式

┌─────────────────────────────────────────────────────────┐
│                    GGUF 文件布局                         │
├──────────────────────┬──────────────────────────────────┤
│ Magic Number         │ 0x46475547 ("GGUF")          4B │
│ Version              │ uint32 (当前 v3)             4B │
│ Tensor Count         │ uint64                       8B │
│ Metadata KV Count    │ uint64                       8B │
├──────────────────────┴──────────────────────────────────┤
│                                                         │
│  Metadata KV Pairs （变长）                              │
│  ┌─────────────────────────────────────────────────┐    │
│  │ general.architecture      = "llama"             │    │
│  │ general.name              = "Meta-Llama-3.1-8B" │    │
│  │ llama.context_length      = 131072              │    │
│  │ llama.embedding_length    = 4096                │    │
│  │ llama.block_count         = 32                  │    │
│  │ llama.attention.head_count = 32                 │    │
│  │ tokenizer.ggml.model      = "gpt2"             │    │
│  │ tokenizer.ggml.tokens     = ["<|begin_of_text|>",│   │
│  │                               "!", "\"", ...]   │    │
│  │ tokenizer.ggml.scores     = [0.0, -1000.0, ...] │   │
│  │ tokenizer.ggml.merges     = ["Ġ t", "Ġ a", ...]│    │
│  │ ...（通常 50-200 个 KV 对）                      │    │
│  └─────────────────────────────────────────────────┘    │
│                                                         │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  Tensor Info Array （每个 tensor 的元信息）              │
│  ┌─────────────────────────────────────────────────┐    │
│  │ name: "token_embd.weight"                       │    │
│  │ n_dims: 2, dims: [4096, 128256]                 │    │
│  │ type: Q8_0                                      │    │
│  │ offset: 0x00000000                              │    │
│  ├─────────────────────────────────────────────────┤    │
│  │ name: "blk.0.attn_q.weight"                     │    │
│  │ n_dims: 2, dims: [4096, 4096]                   │    │
│  │ type: Q4_K_M                                    │    │
│  │ offset: 0x...                                   │    │
│  ├─────────────────────────────────────────────────┤    │
│  │ ...（数百个 tensor）                             │    │
│  └─────────────────────────────────────────────────┘    │
│                                                         │
├──────────────── Alignment Padding ──────────────────────┤
│                                                         │
│  Tensor Data （量化后的权重数据，连续存放）              │
│  ┌─────────────────────────────────────────────────┐    │
│  │ ▓▓▓▓▓▓▓▓▓▓ token_embd.weight (Q8_0)            │    │
│  │ ▓▓▓▓▓▓▓▓▓▓ blk.0.attn_q.weight (Q4_K_M)       │    │
│  │ ▓▓▓▓▓▓▓▓▓▓ blk.0.attn_k.weight (Q4_K_M)       │    │
│  │ ▓▓▓▓▓▓▓▓▓▓ blk.0.attn_v.weight (Q6_K)         │    │
│  │ ▓▓▓▓▓▓▓▓▓▓ ...                                 │    │
│  └─────────────────────────────────────────────────┘    │
│                                                         │
└─────────────────────────────────────────────────────────┘

关键设计：元数据 → tensor 目录 → tensor 数据三段式。所有信息一个文件装完，mmap 后根据 offset 直接定位到任意 tensor，不需要解析 JSON 也不需要额外文件。

2.2 SafeTensors：JSON header + 裸数据

┌─────────────────────────────────────────────────────────┐
│               SafeTensors 文件布局                       │
├──────────────────────┬──────────────────────────────────┤
│ Header Size          │ uint64 LE                     8B │
├──────────────────────┴──────────────────────────────────┤
│                                                         │
│  JSON Header （人类可读）                                │
│  ┌─────────────────────────────────────────────────┐    │
│  │ {                                               │    │
│  │   "__metadata__": {                             │    │
│  │     "format": "pt"                              │    │
│  │   },                                            │    │
│  │   "model.layers.0.self_attn.q_proj.weight": {   │    │
│  │     "dtype": "bfloat16",                        │    │
│  │     "shape": [4096, 4096],                      │    │
│  │     "data_offsets": [0, 33554432]                │    │
│  │   },                                            │    │
│  │   "model.layers.0.self_attn.k_proj.weight": {   │    │
│  │     "dtype": "bfloat16",                        │    │
│  │     "shape": [1024, 4096],                      │    │
│  │     "data_offsets": [33554432, 41943040]         │    │
│  │   },                                            │    │
│  │   ...                                           │    │
│  │ }                                               │    │
│  └─────────────────────────────────────────────────┘    │
│                                                         │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  Tensor Data （原始精度，按 offset 连续存放）            │
│  ┌─────────────────────────────────────────────────┐    │
│  │ ████████████ q_proj.weight (BF16, 32MB)         │    │
│  │ ████████████ k_proj.weight (BF16, 8MB)          │    │
│  │ ████████████ v_proj.weight (BF16, 8MB)          │    │
│  │ ████████████ ...                                │    │
│  └─────────────────────────────────────────────────┘    │
│                                                         │
└─────────────────────────────────────────────────────────┘

配套文件（必须单独存在）：
├── config.json              ← 模型架构参数
├── tokenizer.json           ← 词表
├── tokenizer_config.json    ← 分词器配置
├── special_tokens_map.json  ← 特殊 token 映射
├── generation_config.json   ← 生成参数
└── model.safetensors.index.json  ← 分片索引（大模型必需）

关键设计：header 是标准 JSON，data_offsets 给出每个 tensor 在数据段的字节范围，支持 zero-copy mmap。但只存 tensor 本身，架构、词表、超参全靠外部文件。

三、核心差异

3.1 自包含 vs 模块化

GGUF 部署一个模型：                SafeTensors 部署一个模型：
.                                  .
└── llama-3.1-8b-Q4_K_M.gguf      ├── config.json
    (4.9 GB，拷走就能跑)            ├── tokenizer.json
                                   ├── tokenizer_config.json
                                   ├── special_tokens_map.json
                                   ├── generation_config.json
                                   ├── model.safetensors.index.json
                                   ├── model-00001-of-00004.safetensors
                                   ├── model-00002-of-00004.safetensors
                                   ├── model-00003-of-00004.safetensors
                                   └── model-00004-of-00004.safetensors
                                       (共 ~15 GB，缺任何一个都跑不起来)

GGUF 的哲学是”USB 启动盘”——插上就能用；SafeTensors 的哲学是”组件化”——每个部分独立版本化，灵活替换。

3.2 量化支持

格式	支持的数据类型	典型 7B 模型体积
SafeTensors	F32 / F16 / BF16 / I8 / I16 / I32 / U8	~14 GB (BF16)
GGUF	F32 / F16 / Q8_0 / Q6_K / Q5_K / Q4_K_M / Q3_K / Q2_K / IQ4_XS / IQ3_S / IQ2_S / …	~4.9 GB (Q4_K_M)

SafeTensors 只存标准 dtype，量化是外部工具的事（GPTQ/AWQ 量化后的 int4 权重用 I32 + scales 的方式存）。GGUF 把量化内化到格式里——每种 Q 类型定义了自己的 block 结构、scale/min 编码方式，加载时直接按格式解码。

3.3 Header 格式

GGUF: 二进制 KV 编码
┌──────────┬──────┬──────────────────────────────┐
│ key_len  │ type │ value                        │
│ (uint64) │(u32) │ (按 type 定长/变长)           │
├──────────┼──────┼──────────────────────────────┤
│ 24       │ 8    │ "llama"                      │  ← general.architecture
│ 20       │ 4    │ 131072                       │  ← llama.context_length
│ ...      │      │                              │
└──────────┴──────┴──────────────────────────────┘
支持类型：uint8/16/32/64, int8/16/32/64, float32/64, bool, string, array

SafeTensors: JSON 文本
{
  "model.layers.0.self_attn.q_proj.weight": {
    "dtype": "BF16",
    "shape": [4096, 4096],
    "data_offsets": [0, 33554432]
  }
}

GGUF 的二进制 KV 解析快但不可读；SafeTensors 的 JSON header 人眼可读、调试友好，但大模型的 header 可能达到几十 MB（数千个 tensor 的元信息）。

3.4 安全性

两者都是安全格式——这是它们替代前辈的核心理由。

危险的前辈：
├── PyTorch .bin/.pt  → 底层是 pickle，反序列化时可执行任意 Python 代码
└── NumPy .npy        → 同理，允许 pickle 对象

安全的替代：
├── SafeTensors → 只存 header (JSON) + 数据 (raw bytes)，无代码执行路径
└── GGUF        → 只存 KV 元数据 + tensor 数据，无代码执行路径

SafeTensors 诞生的直接动机就是 HuggingFace Hub 上恶意 .bin 文件的安全隐患——有人在 pickle 里塞过反向 shell。GGUF 继承了前身 GGML 的设计，天然没有这个问题。

四、数据流：从训练到部署

         ┌─────────────────┐
         │  训练框架产出     │
         │  (PyTorch ckpt)  │
         └────────┬────────┘
                  │
         ┌────────▼────────┐
         │   保存为         │
         │  .safetensors   │
         │  (原始精度)      │
         └───┬─────────┬───┘
             │         │
┌────────────▼──┐  ┌──▼────────────────┐
│  GPU 推理      │  │  转换 + 量化       │
│  transformers  │  │  llama.cpp convert │
│  vLLM / TGI    │  │  + quantize        │
│  (F16/BF16)    │  │                    │
└────────────────┘  └──────────┬─────────┘
                               │
                    ┌──────────▼─────────┐
                    │  .gguf              │
                    │  (Q4_K_M / Q6_K)   │
                    │                     │
                    │  llama.cpp / ollama │
                    │  LM Studio          │
                    │  (CPU / 边端)        │
                    └─────────────────────┘

典型工作流：

# 1. HuggingFace 上下载 SafeTensors 格式
huggingface-cli download meta-llama/Llama-3.1-8B

# 2. 转换为 GGUF
python convert_hf_to_gguf.py ./Llama-3.1-8B --outfile llama-3.1-8b-f16.gguf

# 3. 量化
llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-Q4_K_M.gguf Q4_K_M

# 4. 部署
llama-server -m llama-3.1-8b-Q4_K_M.gguf -c 8192

五、速查表

维度	GGUF (llama.cpp)	SafeTensors (HuggingFace)
设计目标	推理部署（CPU/边端优先）	安全存储与交换
文件数量	单文件，自包含	多文件（权重 + 配置 + 词表）
元数据	架构/超参/词表全在内	仅 tensor shape/dtype
量化	原生 30+ 量化格式	仅标准 dtype
分片	通常单文件（split 可选）	大模型自动分片
安全性	无代码执行风险	无代码执行风险
mmap	支持	支持 (zero-copy)
Header	二进制 KV 对	JSON（人类可读）
典型体积 (7B)	~4.9 GB (Q4_K_M)	~14 GB (BF16)
生态	llama.cpp / ollama / LM Studio	transformers / vLLM / TGI
前身	GGML / GGJTv3	PyTorch .bin (pickle)
维护方	ggml-org (Georgi Gerganov)	Hugging Face

六、怎么选

用 SafeTensors 的场景：

在 GPU 上用 transformers / vLLM / TGI 做推理或微调
需要在 HuggingFace Hub 上分发模型
需要 Tensor Parallelism / Pipeline Parallelism 的多卡切分
模型要保留原始精度（F16/BF16）

用 GGUF 的场景：

在 CPU 或 Apple Silicon 上本地推理
需要极致压缩（Q2-Q4 量化）
想要”下载一个文件就能跑”的体验
用 ollama / LM Studio 等桌面工具

两者都用的场景（最常见）：

Hub 上存 SafeTensors（标准交换格式）
本地部署转 GGUF（最终部署包）

它们不是竞争关系，而是流水线上下游：SafeTensors 是训练侧的标准交换格式，GGUF 是部署侧的最终产物。