MLflow

技术栈

AI 框架

mlopsexperiment-trackingmodel-registrydeploymentreproducibility

概览

MLflow

MLflow 是 Databricks 开源的机器学习全生命周期管理平台，由 Matei Zaharia（Apache Spark 创始人）团队打造。它覆盖了从实验追踪、代码打包、模型注册到部署的全流程，是 MLOps 的事实标准。

核心价值：

MLflow Tracking：自动记录参数/指标/模型，mlflow.autolog() 一行搞定
MLflow Projects：用 conda/docker 定义可复现训练环境
MLflow Models：统一模型保存格式，跨 Python/R/Java 部署
MLflow Registry：模型版本管理、Stage（Staging/Production）流转
多框架兼容：PyTorch、TF、XGBoost、spaCy 等全支持

适用场景： 实验管理、模型迭代追踪、团队协作、CI/CD for ML。

安装

环境准备

Python：>= 3.8（推荐 3.10）
存储：本地文件系统即可；生产环境需 S3/Azure Blob/DBFS
数据库：默认 SQLite；生产建议 PostgreSQL/MySQL

安装命令

最小安装

pip install mlflow

带 UI 的完整安装

pip install mlflow[extras]  # 含 gunicorn 用于生产 UI

启动 Tracking UI

mlflow ui --host 0.0.0.0 --port 5000
# 浏览器打开 http://localhost:5000

远程 Tracking Server（生产）

# 后端存储：PostgreSQL + S3
mlflow server \
    --backend-store-uri postgresql://user:pass@host/mlflow \
    --default-artifact-root s3://mlflow-artifacts \
    --host 0.0.0.0 --port 5000

验证安装

import mlflow
print(mlflow.__version__)
mlflow.set_tracking_uri("http://localhost:5000")

常见安装问题

Q1: `mlflow ui` 启动后页面空白

使用默认 sqlite 存储时，需从 mlflow server 启动目录访问。建议使用 mlflow server 而非 mlflow ui。

Q2: Artifacts 不可见

确认 artifact root 路径可访问。远程 server 模式 artifacts 通过 proxy 访问，本地模式直接读文件。

Q3: autolog 没记录到数据

autolog 必须在模型训练前调用。确保 mlflow.autolog() 写在 model.fit() 之前。

示例

MLflow autolog：一行代码自动追踪实验

目标

用 mlflow.autolog() 一行代码自动记录 XGBoost 训练的所有参数、指标、模型，然后在 UI 中对比多次实验。

完整代码

import mlflow
import xgboost as xgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# ─── 1. 设置 MLflow ───
mlflow.set_tracking_uri("http://localhost:5000")  # 或删除此行用本地
mlflow.set_experiment("xgboost-breast-cancer")

# ─── 2. 开启 autolog ───
mlflow.xgboost.autolog()  # 👈 一行搞定！自动记录所有

# ─── 3. 数据准备 ───
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# ─── 4. 多次实验：不同的超参 ───
experiments = [
    {"n_estimators": 50,  "max_depth": 3, "learning_rate": 0.1},
    {"n_estimators": 100, "max_depth": 5, "learning_rate": 0.05},
    {"n_estimators": 200, "max_depth": 7, "learning_rate": 0.01},
]

for params in experiments:
    with mlflow.start_run(run_name=f"xgb_d{params['max_depth']}_lr{params['learning_rate']}"):
        model = xgb.XGBClassifier(**params, eval_metric="logloss", random_state=42)
        model.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False)

        y_pred = model.predict(X_test)
        acc = accuracy_score(y_test, y_pred)
        mlflow.log_metric("test_accuracy", acc)

        print(f"✓ {params} → Accuracy: {acc:.4f}")

print(f"\n查看所有实验: mlflow ui --port 5000")
print(f"实验总数: {len(mlflow.search_runs())}")

运行步骤

# 终端 1: 启动 MLflow UI
pip install mlflow xgboost scikit-learn
mlflow ui --port 5000

# 终端 2: 运行实验
python mlflow_autolog.py

打开 http://localhost:5000 即可看到 3 个实验对比。

预期输出

✓ {'n_estimators': 50, 'max_depth': 3, 'learning_rate': 0.1} → Accuracy: 0.9737
✓ {'n_estimators': 100, 'max_depth': 5, 'learning_rate': 0.05} → Accuracy: 0.9825
✓ {'n_estimators': 200, 'max_depth': 7, 'learning_rate': 0.01} → Accuracy: 0.9825

查看所有实验: mlflow ui --port 5000
实验总数: 3

UI 中可以对比：参数表、指标曲线、运行时长、模型下载。

教程

MLflow 入门教程：四大组件实战

1. MLflow 的四大支柱

Tracking ──── 记录参数/指标/模型 artifact
Projects ──── 打包训练代码为可复现单元
Models  ──── 统一模型格式，跨部署平台
Registry ──── 模型版本管理与审批流转

2. Tracking：手动 vs 自动

autolog（最简单）

mlflow.autolog()  # 支持 PyTorch、TF、XGBoost、LightGBM、spaCy...

手动记录（更灵活）

with mlflow.start_run(run_name="my_run"):
    mlflow.log_param("learning_rate", 0.01)
    mlflow.log_param("batch_size", 32)
    mlflow.log_metric("accuracy", 0.95, step=1)
    mlflow.log_metric("accuracy", 0.96, step=2)  # 支持 step 画出曲线
    mlflow.log_artifact("confusion_matrix.png")   # 保存任何文件

嵌套 Run（Hyperparameter Search）

with mlflow.start_run(run_name="grid_search"):
    for lr in [0.1, 0.01, 0.001]:
        with mlflow.start_run(run_name=f"lr_{lr}", nested=True):
            mlflow.log_param("lr", lr)
            # train...

3. Models：保存与加载

# 保存模型（自动推断 flavor）
signature = infer_signature(X_train, model.predict(X_train))
mlflow.sklearn.log_model(model, "model", signature=signature)

# 加载模型
loaded_model = mlflow.sklearn.load_model("runs:/abc123/model")

# 或使用 registry 中的模型
model = mlflow.pyfunc.load_model("models:/my_model/Production")

4. Registry：版本管理

from mlflow.tracking import MlflowClient

client = MlflowClient()

# 注册模型
client.create_registered_model("breast_cancer_classifier")
client.create_model_version("breast_cancer_classifier", "runs:/abc123/model", "v1")

# 模型晋升
client.transition_model_version_stage(
    name="breast_cancer_classifier",
    version="1",
    stage="Production",
    archive_existing_versions=True,  # 打掉旧 Production
)

Stage 流转：None → Staging → Production → Archived

5. 模型签名（Signature）

from mlflow.models.signature import infer_signature

signature = infer_signature(X_sample, model.predict(X_sample))
mlflow.log_model(model, "model", signature=signature)

# 签名保证：部署时自动验证输入 schema

6. 搜索实验（程序化）

runs = mlflow.search_runs(
    experiment_ids=["1"],
    filter_string="metrics.accuracy > 0.9 and params.max_depth < '10'",
    order_by=["metrics.accuracy DESC"],
)
print(runs[["params.max_depth", "metrics.accuracy"]])

7. MLflow 部署选项

# 本地 REST API
mlflow models serve -m models:/my_model/1 -p 1234

# Docker
mlflow models build-docker -m models:/my_model/1 -n my-model

# Databricks / SageMaker / Azure ML 原生集成

思考题

MLflow Tracking 和 Weights & Biases (wandb) 的核心差异是什么？
模型 Registry 的 Staging → Production 过渡中应该加什么验证步骤？
为什么需要模型签名？推理时能防止什么类型的错误？

参考资料

暂无参考文献

MLflow

概览

MLflow

安装

环境准备

安装命令

最小安装

带 UI 的完整安装

启动 Tracking UI

远程 Tracking Server（生产）

验证安装

常见安装问题

Q1: `mlflow ui` 启动后页面空白

Q2: Artifacts 不可见

Q3: autolog 没记录到数据

示例

MLflow autolog：一行代码自动追踪实验

目标

完整代码

运行步骤

预期输出

教程

MLflow 入门教程：四大组件实战

1. MLflow 的四大支柱

2. Tracking：手动 vs 自动

autolog（最简单）

手动记录（更灵活）

嵌套 Run（Hyperparameter Search）

3. Models：保存与加载

4. Registry：版本管理

5. 模型签名（Signature）

6. 搜索实验（程序化）

7. MLflow 部署选项

思考题

参考资料

TensorFlow

Keras

概览

MLflow

安装

环境准备

安装命令

最小安装

带 UI 的完整安装

启动 Tracking UI

远程 Tracking Server（生产）

验证安装

常见安装问题

Q1: mlflow ui 启动后页面空白

Q2: Artifacts 不可见

Q3: autolog 没记录到数据

示例

MLflow autolog：一行代码自动追踪实验

目标

完整代码

运行步骤

预期输出

教程

MLflow 入门教程：四大组件实战

1. MLflow 的四大支柱

2. Tracking：手动 vs 自动

autolog（最简单）

手动记录（更灵活）

嵌套 Run（Hyperparameter Search）

3. Models：保存与加载

4. Registry：版本管理

5. 模型签名（Signature）

6. 搜索实验（程序化）

7. MLflow 部署选项

思考题

参考资料

Q1: `mlflow ui` 启动后页面空白