PyTorch

技术栈

AI 框架

python深度学习自动微分GPU神经网络torch

概览

PyTorch 技术栈概览

PyTorch 是由 Meta AI（原 Facebook）于 2016 年开源的深度学习框架。它基于 Torch 库，以动态计算图和 Python 原生风格著称，已成为学术界和工业界最受欢迎的深度学习框架之一。

核心特性：

🔧 动态计算图 — Define-by-Run：每次前向传播都重新构建图，调试直观，支持任意控制流
🐍 Pythonic — 与 NumPy 无缝互操作，张量操作 API 高度一致，学习曲线平缓
🚀 GPU 加速 — .cuda() 一键迁移到 GPU，支持多 GPU 分布式训练
🧠 自动微分 — autograd 引擎自动计算梯度，链式求导零手动
📦 生态丰富 — torchvision、torchaudio、torchtext、TorchServe、PyTorch Lightning
📱 端到端部署 — TorchScript → ONNX → 移动端（iOS/Android）、浏览器（Tensor 等）
🎯 研究前沿 — 大量顶会论文代码基于 PyTorch 实现

适用场景： 计算机视觉、自然语言处理、生成式 AI、强化学习、科学计算、模型微调与推理部署。

安装

1. 环境准备

操作系统： Linux（Ubuntu 20.04+ 推荐）/ macOS 11+ / Windows 10+（WSL2 更佳）
Python 版本： Python 3.9 - 3.12（推荐 3.10/3.11）
GPU（可选）： NVIDIA GPU + CUDA 11.8 / 12.1+ + cuDNN
依赖项： pip、conda（推荐用 conda 管理 CUDA 依赖）

检查 GPU（可选）

# Linux / WSL2
nvidia-smi
# 输出应显示 CUDA 版本和 GPU 信息

# 如无输出，需安装 NVIDIA 驱动和 CUDA Toolkit

2. 安装命令

# 前往 pytorch.org 获取最适合你环境的安装命令

# === CPU 版本 ===
pip install torch torchvision torchaudio

# === CUDA 11.8 版本 ===
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# === CUDA 12.1 版本 ===
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# === Conda 安装（推荐，自动处理 CUDA 依赖）===
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

# === 验证安装 ===
python -c "import torch; print(f'PyTorch {torch.__version__}'); print(f'CUDA 可用: {torch.cuda.is_available()}'); print(f'GPU 名称: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else \"无\"}')"

3. 常见安装问题

问题 1：`torch.cuda.is_available()` 返回 False

# 检查 CUDA 驱动
nvidia-smi

# 检查 PyTorch 是否为 CUDA 版本
python -c "import torch; print(torch.version.cuda)"

# 如果输出的 CUDA 版本为 None，说明装了 CPU 版本
# 卸载后重装 CUDA 版本
pip uninstall torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

问题 2：pip 安装速度慢

# 使用清华镜像
pip install torch torchvision torchaudio -i https://pypi.tuna.tsinghua.edu.cn/simple

问题 3：macOS 无 GPU 支持

macOS 自 MPS 加速可用（Apple Silicon）：

# PyTorch 2.0+ 支持 MPS
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")

问题 4：版本不兼容（torch vs torchvision）

# 始终使用同一 index-url 安装三个包
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

问题 5：Docker 中使用 GPU

docker run --gpus all -it pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime

示例

PyTorch Tensor 基础 —— 从 NumPy 到 GPU

目标

理解 Tensor 的创建、操作、形状变换
掌握 CPU ↔ GPU 迁移
理解 Tensor 与 NumPy ndarray 的互操作
掌握 autograd 自动微分基础

完整代码

import torch
import numpy as np

# ============================================================
# 1. Tensor 创建
# ============================================================

# 从列表创建
a = torch.tensor([1, 2, 3, 4])
print(f"从列表创建: {a}")

# 从 NumPy 创建（共享内存）
np_arr = np.array([5, 6, 7, 8])
b = torch.from_numpy(np_arr)
np_arr[0] = 99
print(f"NumPy → Tensor (共享内存): {b}")  # b[0] 也变成 99

# 特殊张量
zeros = torch.zeros(3, 4)           # 全零 3×4
ones = torch.ones(2, 3)             # 全一 2×3
rand = torch.rand(3, 3)             # [0,1) 均匀分布
randn = torch.randn(3, 3)           # 标准正态分布
arange = torch.arange(0, 10, 2)    # 类似 Python range
linspace = torch.linspace(0, 1, 5) # 等分

# 指定数据类型
int_tensor = torch.tensor([1, 2, 3], dtype=torch.int32)
float_tensor = torch.tensor([1, 2, 3], dtype=torch.float32)
print(f"\nint32: {int_tensor.dtype}, float32: {float_tensor.dtype}")

# ============================================================
# 2. Tensor 属性
# ============================================================

x = torch.randn(4, 3, 28, 28)  # (batch, channel, height, width)
print(f"\n形状: {x.shape}")
print(f"维度数: {x.dim()}")
print(f"元素总数: {x.numel()}")
print(f"数据类型: {x.dtype}")
print(f"所在设备: {x.device}")

# ============================================================
# 3. 索引与切片
# ============================================================

t = torch.arange(1, 13).reshape(3, 4)
print(f"\n原始张量:\n{t}")

print(f"第1行: {t[0]}")
print(f"第2列: {t[:, 1]}")
print(f"前两行后两列:\n{t[:2, -2:]}")
print(f"按条件筛选: {t[t > 5]}")

# 花式索引
indices = torch.tensor([0, 2])
print(f"选取第0和第2行:\n{t[indices]}")

# ============================================================
# 4. 形状操作
# ============================================================

a = torch.arange(8)
print(f"\n原始: {a}")

# reshape / view（view 要求内存连续）
print(f"reshape 2×4:\n{a.reshape(2, 4)}")
print(f"reshape 4×2:\n{a.view(4, 2)}")

# 转置
m = torch.randn(3, 4)
print(f"转置:\n{m.T} 形状: {m.T.shape}")

# 添加维度
x = torch.tensor([1, 2, 3])
print(f"unsqueeze(0): {x.unsqueeze(0).shape}")  # (1, 3)
print(f"unsqueeze(1): {x.unsqueeze(1).shape}")  # (3, 1)

# 移除维度
y = torch.randn(1, 3, 1, 5)
print(f"squeeze: {y.squeeze().shape}")           # (3, 5)

# 拼接
a = torch.randn(2, 3)
b = torch.randn(2, 3)
print(f"dim=0 拼接: {torch.cat([a, b], dim=0).shape}")  # (4, 3)
print(f"dim=1 拼接: {torch.cat([a, b], dim=1).shape}")  # (2, 6)
print(f"stack 新维度: {torch.stack([a, b], dim=0).shape}")  # (2, 2, 3)

# ============================================================
# 5. 数学运算
# ============================================================

a = torch.randn(3, 4)
b = torch.randn(3, 4)

# 逐元素运算
add = a + b
mul = a * b          # 逐元素乘法（不是矩阵乘法）
div = a / b

# 矩阵乘法
c = torch.randn(4, 5)
matmul1 = a @ c              # Python 3.5+
matmul2 = torch.matmul(a, c)
matmul3 = torch.mm(a, c)     # 仅限 2D

# 聚合
print(f"\n求和: {a.sum()}, 均值: {a.mean()}, 最大值: {a.max()}")
print(f"按维度求和: {a.sum(dim=0).shape}")   # (4,)
print(f"按维度求和保持维度: {a.sum(dim=0, keepdim=True).shape}")  # (1, 4)

# ============================================================
# 6. GPU 迁移
# ============================================================

if torch.cuda.is_available():
    device = torch.device("cuda")
    x_gpu = x.to(device)
    print(f"\n✅ GPU Tensor: {x_gpu.device}")

    # GPU 运算
    y_gpu = torch.randn(3, 3, device="cuda")
    result = x_gpu[:3, :3] @ y_gpu

    # 回传到 CPU
    result_cpu = result.cpu()
    print(f"✅ 回传 CPU: {result_cpu.device}")

    # 直接创建在 GPU 上
    gpu_tensor = torch.ones(4, 4, device="cuda")

# ============================================================
# 7. autograd 自动微分
# ============================================================

# 需要计算梯度
x = torch.tensor([2.0, 3.0], requires_grad=True)
w = torch.tensor([0.5, 1.0], requires_grad=True)
b = torch.tensor(0.1, requires_grad=True)

# 前向传播：y = w·x + b
y = (w * x).sum() + b
print(f"\ny = {y.item():.4f}")

# 反向传播
y.backward()
print(f"∂y/∂x = {x.grad}")  # 应等于 w = [0.5, 1.0]
print(f"∂y/∂w = {w.grad}")  # 应等于 x = [2.0, 3.0]
print(f"∂y/∂b = {b.grad}")  # 应等于 1.0

# 梯度清零（训练循环中必须）
x.grad.zero_()
w.grad.zero_()
b.grad.zero_()

运行输出示例

从列表创建: tensor([1, 2, 3, 4])
NumPy → Tensor (共享内存): tensor([99,  6,  7,  8])

形状: torch.Size([4, 3, 28, 28])
维度数: 4
元素总数: 9408

y = 4.1000
∂y/∂x = tensor([0.5000, 1.0000])
∂y/∂w = tensor([2., 3.])
∂y/∂b = tensor(1.)

关键要点

概念	说明
`torch.tensor()` vs `torch.Tensor()`	前者是工厂函数（复制数据），后者是构造器（未初始化）
`from_numpy()`	与 NumPy 共享内存，修改会互相影响
`.to(device)`	通用设备迁移（CPU / CUDA / MPS）
`requires_grad=True`	标记需要追踪梯度
`.backward()`	自动计算所有 `requires_grad=True` 张量的梯度
`.grad`	存储计算出的梯度
`torch.no_grad()`	上下文管理器，禁用梯度计算（推理时用）

PyTorch 神经网络 —— MNIST 手写数字识别

目标

构建完整的训练/验证/测试 Pipeline
掌握 nn.Module、DataLoader、optimizer 三大组件
理解训练循环（forward → loss → backward → step）
使用 GPU 加速训练

完整代码

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
import matplotlib.pyplot as plt
from tqdm import tqdm

# ============================================================
# 0. 配置
# ============================================================
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
BATCH_SIZE = 64
EPOCHS = 5
LR = 0.001

print(f"使用设备: {DEVICE}")

# ============================================================
# 1. 数据加载与预处理
# ============================================================
transform = transforms.Compose([
    transforms.ToTensor(),                     # 0-255 → 0-1，HWC → CHW
    transforms.Normalize((0.1307,), (0.3081,)) # MNIST 的均值和标准差
])

train_dataset = datasets.MNIST(
    root="./data", train=True, download=True, transform=transform
)
test_dataset = datasets.MNIST(
    root="./data", train=False, download=True, transform=transform
)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)

print(f"训练集大小: {len(train_dataset)}, 测试集大小: {len(test_dataset)}")

# ============================================================
# 2. 定义模型
# ============================================================
class CNN(nn.Module):
    """简单的卷积神经网络"""
    def __init__(self, num_classes=10):
        super().__init__()
        # 输入: (1, 28, 28)

        self.conv1 = nn.Conv2d(1, 16, kernel_size=3, padding=1)   # → (16, 28, 28)
        self.bn1 = nn.BatchNorm2d(16)

        self.conv2 = nn.Conv2d(16, 32, kernel_size=3, padding=1)  # → (32, 14, 14)
        self.bn2 = nn.BatchNorm2d(32)

        self.conv3 = nn.Conv2d(32, 64, kernel_size=3, padding=1)  # → (64, 7, 7)
        self.bn3 = nn.BatchNorm2d(64)

        self.pool = nn.MaxPool2d(2, 2)          # 每次减半尺寸
        self.dropout = nn.Dropout(0.3)

        self.fc1 = nn.Linear(64 * 3 * 3, 128)
        self.fc2 = nn.Linear(128, num_classes)

    def forward(self, x):
        x = F.relu(self.bn1(self.conv1(x)))
        x = self.pool(x)                        # (16, 14, 14)

        x = F.relu(self.bn2(self.conv2(x)))
        x = self.pool(x)                        # (32, 7, 7)

        x = F.relu(self.bn3(self.conv3(x)))
        x = self.pool(x)                        # (64, 3, 3)

        x = x.view(x.size(0), -1)               # 展平
        x = F.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.fc2(x)
        return x

# 实例化
model = CNN(num_classes=10).to(DEVICE)
print(f"模型参数量: {sum(p.numel() for p in model.parameters()):,}")

# ============================================================
# 3. 损失函数与优化器
# ============================================================
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=LR)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=2, gamma=0.5)

# ============================================================
# 4. 训练与评估函数
# ============================================================
def train_epoch(model, loader, criterion, optimizer, device):
    model.train()
    running_loss = 0.0
    correct = 0
    total = 0

    for images, labels in tqdm(loader, desc="训练", leave=False):
        images, labels = images.to(device), labels.to(device)

        # forward
        outputs = model(images)
        loss = criterion(outputs, labels)

        # backward
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # 统计
        running_loss += loss.item() * images.size(0)
        _, predicted = torch.max(outputs, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    avg_loss = running_loss / total
    accuracy = correct / total
    return avg_loss, accuracy


@torch.no_grad()
def evaluate(model, loader, criterion, device):
    model.eval()
    running_loss = 0.0
    correct = 0
    total = 0

    for images, labels in loader:
        images, labels = images.to(device), labels.to(device)
        outputs = model(images)
        loss = criterion(outputs, labels)

        running_loss += loss.item() * images.size(0)
        _, predicted = torch.max(outputs, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    avg_loss = running_loss / total
    accuracy = correct / total
    return avg_loss, accuracy

# ============================================================
# 5. 训练循环
# ============================================================
history = {"train_loss": [], "train_acc": [], "test_loss": [], "test_acc": []}

for epoch in range(1, EPOCHS + 1):
    print(f"\n{'='*40}\nEpoch {epoch}/{EPOCHS}")

    train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, DEVICE)
    test_loss, test_acc = evaluate(model, test_loader, criterion, DEVICE)

    scheduler.step()

    history["train_loss"].append(train_loss)
    history["train_acc"].append(train_acc)
    history["test_loss"].append(test_loss)
    history["test_acc"].append(test_acc)

    print(f"Train Loss: {train_loss:.4f} | Train Acc: {train_acc:.2%}")
    print(f"Test  Loss: {test_loss:.4f} | Test  Acc: {test_acc:.2%}")

print(f"\n✅ 训练完成！最终测试准确率: {test_acc:.2%}")

# ============================================================
# 6. 保存模型
# ============================================================
torch.save({
    "epoch": EPOCHS,
    "model_state_dict": model.state_dict(),
    "optimizer_state_dict": optimizer.state_dict(),
    "test_acc": test_acc,
}, "mnist_cnn.pth")
print("模型已保存为 mnist_cnn.pth")

# ============================================================
# 7. 预测单个样本
# ============================================================
model.eval()
sample, label = test_dataset[0]
with torch.no_grad():
    output = model(sample.unsqueeze(0).to(DEVICE))
    prob = F.softmax(output, dim=1)
    pred = torch.argmax(prob, dim=1).item()

print(f"\n实际数字: {label}")
print(f"预测数字: {pred}")
print(f"各类概率: {prob.cpu().numpy().round(4)}")

预期输出（Epoch 5）

训练集大小: 60000, 测试集大小: 10000
模型参数量: 118,474

Epoch 5/5
Train Loss: 0.0123 | Train Acc: 99.52%
Test  Loss: 0.0214 | Test  Acc: 99.31%

✅ 训练完成！最终测试准确率: 99.31%

训练 Pipeline 图解

for epoch in range(EPOCHS):
    for batch in DataLoader:
        images, labels → to(device)
        
        # ① 前向传播
        outputs = model(images)
        loss = criterion(outputs, labels)
        
        # ② 反向传播
        optimizer.zero_grad()     # 清空旧梯度
        loss.backward()           # 计算新梯度
        optimizer.step()          # 更新参数
        
    # ③ 学习率调整
    scheduler.step()
    
    # ④ 验证（torch.no_grad()）
    evaluate(model, test_loader)

关键要点

概念	说明
`nn.Module`	所有神经网络层的基类，定义 `forward()`
`DataLoader`	自动批处理、打乱、多线程加载
`transforms.Compose`	数据预处理流水线
`optimizer.zero_grad()`	必须！否则梯度会累积
`model.train()` / `model.eval()`	切换 Dropout/BN 行为
`torch.no_grad()`	推理时禁用梯度计算，节省内存
`state_dict`	模型的参数字典，用于保存和加载
`torch.save(obj, path)`	通用序列化保存（模型/字典/任意对象）

教程

PyTorch 入门教程 —— 从线性回归到神经网络

本章目标

理解动态计算图与自动微分
手写梯度下降 → 使用 PyTorch 的 optimizer
掌握 nn.Module 的模块化设计思想
理解损失函数与优化器的选择

1. 动态计算图：为什么 PyTorch 如此灵活？

静态图（TensorFlow 1.x）： 先定义完整图 → 编译 → 运行

# TF 1.x 风格（不再推荐）
x = tf.placeholder(tf.float32, shape=[None, 784])
W = tf.Variable(tf.zeros([784, 10]))
y = tf.matmul(x, W)  # 这时还没执行，只是在建图
# ... session.run() 才真正执行

动态图（PyTorch）： 每一步操作立即执行，图随代码走

x = torch.randn(10, 784)
W = torch.randn(784, 10, requires_grad=True)
y = x @ W           # 立即计算！可以 print、debug
loss = y.sum()
loss.backward()     # 自动计算 W.grad

优势： 你可以在 forward 中使用 if/for/while，这对应变长序列、条件分支等场景极为关键。

2. 从零手写梯度下降 → 使用 PyTorch

阶段一：纯 NumPy 手动求导

import numpy as np

# 数据：y = 3x + 2 + noise
np.random.seed(42)
X = np.random.randn(100, 1)
y = 3 * X + 2 + np.random.randn(100, 1) * 0.3

w, b = np.random.randn(1), np.random.randn(1)
lr = 0.01

for epoch in range(1000):
    y_pred = X * w + b
    loss = ((y_pred - y) ** 2).mean()

    # 手动计算梯度（容易出错！）
    dw = (2 / len(y)) * (X * (y_pred - y)).sum()
    db = (2 / len(y)) * (y_pred - y).sum()

    w -= lr * dw
    b -= lr * db

阶段二：PyTorch autograd 自动求导

import torch

X_t = torch.tensor(X, dtype=torch.float32)
y_t = torch.tensor(y, dtype=torch.float32)

w = torch.randn(1, requires_grad=True)
b = torch.randn(1, requires_grad=True)

for epoch in range(1000):
    y_pred = X_t * w + b
    loss = ((y_pred - y_t) ** 2).mean()

    loss.backward()  # 自动计算 w.grad, b.grad

    with torch.no_grad():  # 更新时不需要梯度
        w -= lr * w.grad
        b -= lr * b.grad
        w.grad.zero_()      # 清零，否则会累积
        b.grad.zero_()

阶段三：PyTorch optimizer + nn.Module

model = nn.Linear(1, 1)           # 一行定义 w,b
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

for epoch in range(1000):
    y_pred = model(X_t)
    loss = criterion(y_pred, y_t)

    optimizer.zero_grad()          # 一行清零所有梯度
    loss.backward()
    optimizer.step()               # 一行更新所有参数

3. nn.Module 模块化设计

class MLP(nn.Module):
    """多层感知机 —— 就像搭乐高"""
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        # 定义"层"（子模块）
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, hidden_dim)
        self.fc3 = nn.Linear(hidden_dim, output_dim)
        self.bn = nn.BatchNorm1d(hidden_dim)
        self.dropout = nn.Dropout(0.3)

    def forward(self, x):
        # 定义"连接关系"
        x = F.relu(self.bn(self.fc1(x)))
        x = F.relu(self.bn(self.fc2(x)))
        x = self.dropout(x)
        x = self.fc3(x)
        return x

# PyTorch 自动追踪所有子模块
model = MLP(784, 256, 10)
print(list(model.children()))       # 遍历子模块
print(sum(p.numel() for p in model.parameters()))  # 总参数量

# 参数可整体迁移
model = model.to("cuda")
model = model.float()               # 转 float32
model = model.half()                # 转 float16（混合精度）

4. 损失函数速查

任务	损失函数	PyTorch
回归	均方误差	`nn.MSELoss()`
回归	平均绝对误差	`nn.L1Loss()`
二分类	二元交叉熵	`nn.BCEWithLogitsLoss()`
多分类	交叉熵（内置 softmax）	`nn.CrossEntropyLoss()`
不平衡分类	Focal Loss	自定义
相似度学习	Triplet Margin Loss	`nn.TripletMarginLoss()`

关键：CrossEntropyLoss 自动含 softmax！

# ❌ 错误 —— 双重 softmax
output = F.softmax(logits, dim=1)
loss = nn.CrossEntropyLoss()(output, labels)

# ✅ 正确 —— CrossEntropyLoss 内部已含 LogSoftmax
loss = nn.CrossEntropyLoss()(logits, labels)

5. 优化器对比

优化器	特性	适用场景
`SGD`	经典，需手动调 lr	需要强泛化能力时
`SGD + Momentum`	加速 + 抗震荡	CV 任务常用
`Adam`	自适应学习率	默认首选，NLP 常用
`AdamW`	Adam + 解耦权重衰减	大模型/Transformer
`RMSprop`	适合非稳态目标	RNN/强化学习

6. 实用技巧：梯度裁剪

# 防止梯度爆炸（RNN/Transformer 训练必备）
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

# 完整的训练步骤
for batch in dataloader:
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = criterion(outputs, labels)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    optimizer.step()

思考题

动态计算图相比静态图，性能上有损失吗？PyTorch 2.0 的 torch.compile 如何解决？
optimizer.zero_grad() 如果忘记调用会怎样？
model.train() 和 model.eval() 具体影响了哪些层的行为？
为什么 CrossEntropyLoss 的输入不能经过 softmax？

PyTorch 进阶实战 —— 迁移学习、混合精度与模型部署

本章目标

使用预训练模型进行迁移学习（Fine-tuning）
掌握混合精度训练（AMP）加速
模型导出：TorchScript → ONNX
模型量化与推理优化

1. 迁移学习（Fine-tuning）

1.1 使用 torchvision 预训练模型

import torch
import torch.nn as nn
from torchvision import models, transforms
from torch.utils.data import DataLoader
from torchvision.datasets import ImageFolder

# 加载预训练 ResNet50
model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)

# 冻结全部参数（特征提取模式）
for param in model.parameters():
    param.requires_grad = False

# 替换最后的全连接层（适应自己的分类数）
num_features = model.fc.in_features
model.fc = nn.Sequential(
    nn.Linear(num_features, 256),
    nn.ReLU(),
    nn.Dropout(0.4),
    nn.Linear(256, 10),  # 假设 10 类
)
# 新层的 requires_grad 默认为 True

print(f"可训练参数: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")
print(f"冻结参数: {sum(p.numel() for p in model.parameters() if not p.requires_grad):,}")

1.2 渐进式解冻策略

def progressive_unfreeze(model, epoch, total_epochs):
    """逐步解冻更多层"""
    if epoch >= total_epochs * 0.5:
        # 训练后半段：解冻 layer4
        for param in model.layer4.parameters():
            param.requires_grad = True
    if epoch >= total_epochs * 0.75:
        # 最后 25%：解冻 layer3
        for param in model.layer3.parameters():
            param.requires_grad = True

# 训练循环中使用
for epoch in range(total_epochs):
    progressive_unfreeze(model, epoch, total_epochs)
    # ... 训练代码

    # 注意：解冻后需要更新 optimizer
    if epoch == int(total_epochs * 0.5):
        optimizer = optim.Adam(
            filter(lambda p: p.requires_grad, model.parameters()),
            lr=LR * 0.1  # 降低学习率
        )

1.3 不同层使用不同学习率

# 对新层使用更高学习率，预训练层使用更低学习率
optimizer = optim.SGD([
    {"params": model.fc.parameters(), "lr": 1e-3},        # 新层：高 lr
    {"params": model.layer4.parameters(), "lr": 1e-4},    # 顶层：中 lr
    {"params": model.layer3.parameters(), "lr": 1e-5},    # 中下层：低 lr
], momentum=0.9, weight_decay=1e-4)

2. 混合精度训练（AMP）

from torch.cuda.amp import GradScaler, autocast

scaler = GradScaler()  # 梯度缩放器

for epoch in range(EPOCHS):
    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)

        optimizer.zero_grad()

        # 前向：自动选择 float16 或 float32
        with autocast():
            outputs = model(images)
            loss = criterion(outputs, labels)

        # 反向：scaler 缩放 loss 防止下溢
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()  # 动态调整缩放因子

print(f"AMP 训练完成！显存节省约 40%，速度提升 1.5-2x")

AMP 原理简述

正常 FP32 训练：
  float32 × 4 bytes → 大显存、慢

AMP 混合精度：
  大部分运算用 float16（快、省显存）
  关键运算（softmax、loss）用 float32（保精度）
  GradScaler 防止小梯度在 float16 中变为 0

3. 模型导出与部署

3.1 TorchScript（推荐）

# 方法一：Trace（适合无控制流的模型）
example_input = torch.randn(1, 3, 224, 224).to(device)
traced_model = torch.jit.trace(model.eval(), example_input)
traced_model.save("resnet50_traced.pt")

# 方法二：Script（支持控制流）
@torch.jit.script
def inference_forward(x):
    return model(x)

# 加载与推理
loaded_model = torch.jit.load("resnet50_traced.pt")
with torch.no_grad():
    output = loaded_model(example_input)

3.2 ONNX 导出

# 导出 ONNX
torch.onnx.export(
    model.eval(),
    example_input,                           # 样例输入
    "resnet50.onnx",                         # 输出路径
    export_params=True,
    opset_version=17,
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={
        "input": {0: "batch_size"},          # 动态 batch
        "output": {0: "batch_size"},
    },
)

# 验证 ONNX 模型
import onnx
onnx_model = onnx.load("resnet50.onnx")
onnx.checker.check_model(onnx_model)
print("✅ ONNX 模型验证通过")

# ONNX Runtime 推理
import onnxruntime as ort
session = ort.InferenceSession("resnet50.onnx")
outputs = session.run(None, {"input": example_input.numpy()})

3.3 模型量化

# 训练后动态量化（适合 LSTM / Transformer / Linear）
quantized_model = torch.quantization.quantize_dynamic(
    model,                              # 原模型
    {nn.Linear, nn.LSTM},              # 要量化的层类型
    dtype=torch.qint8                   # int8 量化
)
# 模型大小减少约 75%，推理加速 2-3x

# 保存与加载
torch.save(quantized_model.state_dict(), "model_quantized.pth")

4. 推理优化工具对比

方法	适用场景	速度提升	模型大小
TorchScript	PyTorch 部署	1.2-1.5x	不变
ONNX Runtime	跨框架部署	2-3x	不变
动态量化	CPU 推理	2-3x	-75%
TensorRT	NVIDIA GPU 推理	3-5x	-50%
Core ML	Apple 生态	2-3x	-50%

5. 常见训练调试技巧

5.1 梯度检查

def check_gradients(model):
    """检查梯度是否正常"""
    total_norm = 0
    for name, param in model.named_parameters():
        if param.grad is not None:
            param_norm = param.grad.data.norm(2)
            total_norm += param_norm.item() ** 2
            if torch.isnan(param.grad).any():
                print(f"⚠️  NaN 梯度: {name}")
    total_norm = total_norm ** 0.5
    print(f"梯度范数: {total_norm:.4f}")
    return total_norm

5.2 过拟合检查

def check_overfit(train_acc, val_acc, threshold=0.05):
    gap = train_acc - val_acc
    if gap > threshold:
        print(f"⚠️ 可能过拟合！训练-验证差距: {gap:.2%}")
        print("建议：增加 Dropout / 数据增强 / 权重衰减")

思考题

迁移学习中"冻结"与"解冻"分别适用于什么阶段？全量微调 vs 只训练分类头的取舍？
AMP 混合精度训练中 GradScaler 的作用是什么？不使用时会出现什么问题？
TorchScript trace 和 script 有何区别？各自的局限性是什么？
量化模型一定会降低精度吗？如何评估量化后的精度损失？

参考资料

[1] Eli Stevens, Luca Antiga, Thomas Viehmann. Deep Learning with PyTorch. 2020.
[2] Meta AI. PyTorch 官方文档. 2024.
[3] Aston Zhang, Zack C. Lipton, Mu Li, et al.. Dive into Deep Learning (PyTorch版). 2023.
[4] Ian Pointer. Programming PyTorch for Deep Learning. 2019.

概览

PyTorch 技术栈概览

安装

1. 环境准备

检查 GPU（可选）

2. 安装命令

3. 常见安装问题

问题 1：torch.cuda.is_available() 返回 False

问题 2：pip 安装速度慢

问题 3：macOS 无 GPU 支持

问题 4：版本不兼容（torch vs torchvision）

问题 5：Docker 中使用 GPU

示例

PyTorch Tensor 基础 —— 从 NumPy 到 GPU

目标

完整代码

运行输出示例

关键要点

PyTorch 神经网络 —— MNIST 手写数字识别

目标

完整代码

预期输出（Epoch 5）

训练 Pipeline 图解

关键要点

教程

PyTorch 入门教程 —— 从线性回归到神经网络

本章目标

1. 动态计算图：为什么 PyTorch 如此灵活？

2. 从零手写梯度下降 → 使用 PyTorch

阶段一：纯 NumPy 手动求导

阶段二：PyTorch autograd 自动求导

阶段三：PyTorch optimizer + nn.Module

3. nn.Module 模块化设计

4. 损失函数速查

关键：CrossEntropyLoss 自动含 softmax！

5. 优化器对比

6. 实用技巧：梯度裁剪

思考题

PyTorch 进阶实战 —— 迁移学习、混合精度与模型部署

本章目标

1. 迁移学习（Fine-tuning）

1.1 使用 torchvision 预训练模型

1.2 渐进式解冻策略

1.3 不同层使用不同学习率

2. 混合精度训练（AMP）

AMP 原理简述

3. 模型导出与部署

3.1 TorchScript（推荐）

3.2 ONNX 导出

3.3 模型量化

4. 推理优化工具对比

5. 常见训练调试技巧

5.1 梯度检查

5.2 过拟合检查

思考题

参考资料

问题 1：`torch.cuda.is_available()` 返回 False