文档

PyTorch 进阶实战 —— 迁移学习、混合精度与模型部署

本章目标

使用预训练模型进行迁移学习（Fine-tuning）
掌握混合精度训练（AMP）加速
模型导出：TorchScript → ONNX
模型量化与推理优化

1. 迁移学习（Fine-tuning）

1.1 使用 torchvision 预训练模型

import torch
import torch.nn as nn
from torchvision import models, transforms
from torch.utils.data import DataLoader
from torchvision.datasets import ImageFolder

# 加载预训练 ResNet50
model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)

# 冻结全部参数（特征提取模式）
for param in model.parameters():
    param.requires_grad = False

# 替换最后的全连接层（适应自己的分类数）
num_features = model.fc.in_features
model.fc = nn.Sequential(
    nn.Linear(num_features, 256),
    nn.ReLU(),
    nn.Dropout(0.4),
    nn.Linear(256, 10),  # 假设 10 类
)
# 新层的 requires_grad 默认为 True

print(f"可训练参数: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")
print(f"冻结参数: {sum(p.numel() for p in model.parameters() if not p.requires_grad):,}")

1.2 渐进式解冻策略

def progressive_unfreeze(model, epoch, total_epochs):
    """逐步解冻更多层"""
    if epoch >= total_epochs * 0.5:
        # 训练后半段：解冻 layer4
        for param in model.layer4.parameters():
            param.requires_grad = True
    if epoch >= total_epochs * 0.75:
        # 最后 25%：解冻 layer3
        for param in model.layer3.parameters():
            param.requires_grad = True

# 训练循环中使用
for epoch in range(total_epochs):
    progressive_unfreeze(model, epoch, total_epochs)
    # ... 训练代码

    # 注意：解冻后需要更新 optimizer
    if epoch == int(total_epochs * 0.5):
        optimizer = optim.Adam(
            filter(lambda p: p.requires_grad, model.parameters()),
            lr=LR * 0.1  # 降低学习率
        )

1.3 不同层使用不同学习率

# 对新层使用更高学习率，预训练层使用更低学习率
optimizer = optim.SGD([
    {"params": model.fc.parameters(), "lr": 1e-3},        # 新层：高 lr
    {"params": model.layer4.parameters(), "lr": 1e-4},    # 顶层：中 lr
    {"params": model.layer3.parameters(), "lr": 1e-5},    # 中下层：低 lr
], momentum=0.9, weight_decay=1e-4)

2. 混合精度训练（AMP）

from torch.cuda.amp import GradScaler, autocast

scaler = GradScaler()  # 梯度缩放器

for epoch in range(EPOCHS):
    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)

        optimizer.zero_grad()

        # 前向：自动选择 float16 或 float32
        with autocast():
            outputs = model(images)
            loss = criterion(outputs, labels)

        # 反向：scaler 缩放 loss 防止下溢
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()  # 动态调整缩放因子

print(f"AMP 训练完成！显存节省约 40%，速度提升 1.5-2x")

AMP 原理简述

正常 FP32 训练：
  float32 × 4 bytes → 大显存、慢

AMP 混合精度：
  大部分运算用 float16（快、省显存）
  关键运算（softmax、loss）用 float32（保精度）
  GradScaler 防止小梯度在 float16 中变为 0

3. 模型导出与部署

3.1 TorchScript（推荐）

# 方法一：Trace（适合无控制流的模型）
example_input = torch.randn(1, 3, 224, 224).to(device)
traced_model = torch.jit.trace(model.eval(), example_input)
traced_model.save("resnet50_traced.pt")

# 方法二：Script（支持控制流）
@torch.jit.script
def inference_forward(x):
    return model(x)

# 加载与推理
loaded_model = torch.jit.load("resnet50_traced.pt")
with torch.no_grad():
    output = loaded_model(example_input)

3.2 ONNX 导出

# 导出 ONNX
torch.onnx.export(
    model.eval(),
    example_input,                           # 样例输入
    "resnet50.onnx",                         # 输出路径
    export_params=True,
    opset_version=17,
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={
        "input": {0: "batch_size"},          # 动态 batch
        "output": {0: "batch_size"},
    },
)

# 验证 ONNX 模型
import onnx
onnx_model = onnx.load("resnet50.onnx")
onnx.checker.check_model(onnx_model)
print("✅ ONNX 模型验证通过")

# ONNX Runtime 推理
import onnxruntime as ort
session = ort.InferenceSession("resnet50.onnx")
outputs = session.run(None, {"input": example_input.numpy()})

3.3 模型量化

# 训练后动态量化（适合 LSTM / Transformer / Linear）
quantized_model = torch.quantization.quantize_dynamic(
    model,                              # 原模型
    {nn.Linear, nn.LSTM},              # 要量化的层类型
    dtype=torch.qint8                   # int8 量化
)
# 模型大小减少约 75%，推理加速 2-3x

# 保存与加载
torch.save(quantized_model.state_dict(), "model_quantized.pth")

4. 推理优化工具对比

方法	适用场景	速度提升	模型大小
TorchScript	PyTorch 部署	1.2-1.5x	不变
ONNX Runtime	跨框架部署	2-3x	不变
动态量化	CPU 推理	2-3x	-75%
TensorRT	NVIDIA GPU 推理	3-5x	-50%
Core ML	Apple 生态	2-3x	-50%

5. 常见训练调试技巧

5.1 梯度检查

def check_gradients(model):
    """检查梯度是否正常"""
    total_norm = 0
    for name, param in model.named_parameters():
        if param.grad is not None:
            param_norm = param.grad.data.norm(2)
            total_norm += param_norm.item() ** 2
            if torch.isnan(param.grad).any():
                print(f"⚠️  NaN 梯度: {name}")
    total_norm = total_norm ** 0.5
    print(f"梯度范数: {total_norm:.4f}")
    return total_norm

5.2 过拟合检查

def check_overfit(train_acc, val_acc, threshold=0.05):
    gap = train_acc - val_acc
    if gap > threshold:
        print(f"⚠️ 可能过拟合！训练-验证差距: {gap:.2%}")
        print("建议：增加 Dropout / 数据增强 / 权重衰减")

思考题

迁移学习中"冻结"与"解冻"分别适用于什么阶段？全量微调 vs 只训练分类头的取舍？
AMP 混合精度训练中 GradScaler 的作用是什么？不使用时会出现什么问题？
TorchScript trace 和 script 有何区别？各自的局限性是什么？
量化模型一定会降低精度吗？如何评估量化后的精度损失？

信息

路径: /tech-stacks/pytorch/tutorial/02-进阶实战-迁移学习与部署.md
更新时间: 2026/5/30