文档

LightGBM 原生类别特征分类

目标

展示 LightGBM 两大特色：原生类别特征支持（无需 One-Hot）和 early stopping 回调。

完整代码

import lightgbm as lgb
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# ─── 1. 构造含类别特征的数据 ───
np.random.seed(42)
n = 5000
df = pd.DataFrame({
    "age": np.random.randint(18, 70, n),
    "city": np.random.choice(["Beijing", "Shanghai", "Guangzhou", "Shenzhen", "Hangzhou"], n),
    "education": np.random.choice(["High School", "Bachelor", "Master", "PhD"], n),
    "income": np.random.normal(15000, 5000, n),
    "experience": np.random.randint(0, 30, n),
})
df["target"] = (
    (df["income"] > 15000).astype(int)
    + (df["education"].isin(["Master", "PhD"])).astype(int)
    + (df["experience"] > 10).astype(int)
)
df["target"] = (df["target"] >= 2).astype(int)  # 二分类

print(f"正样本比例: {df['target'].mean():.2%}")

# ─── 2. 特征准备（类别特征声明为 'category' dtype） ───
cat_cols = ["city", "education"]
for col in cat_cols:
    df[col] = df[col].astype("category")

X = df.drop("target", axis=1)
y = df["target"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# ─── 3. 训练 + Early Stopping ───
callbacks = [
    lgb.early_stopping(stopping_rounds=20, verbose=1),
    lgb.log_evaluation(period=50),
]

model = lgb.LGBMClassifier(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=7,
    num_leaves=31,
    subsample=0.8,
    colsample_bytree=0.8,
    min_child_samples=20,
    random_state=42,
    verbose=-1,
)

model.fit(
    X_train, y_train,
    eval_set=[(X_test, y_test)],
    eval_metric="auc",
    callbacks=callbacks,
)

# ─── 4. 评估 ───
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f"\n测试准确率: {acc:.4f}")
print(f"最佳迭代轮次: {model.best_iteration_}")
print(f"最佳 AUC: {model.best_score_['valid_0']['auc']:.4f}")

# ─── 5. 特征重要性 ───
importance = pd.DataFrame({
    "feature": X.columns,
    "importance": model.feature_importances_,
}).sort_values("importance", ascending=False)
print("\n特征重要性:")
print(importance.to_string(index=False))

运行步骤

pip install lightgbm pandas scikit-learn
python lgb_categorical.py

预期输出

正样本比例: 35.62%
Training until validation scores don't improve for 20 rounds
[50]  valid_0's auc: 0.8234
[100] valid_0's auc: 0.8456
Early stopping, best iteration is [87]
测试准确率: 0.7823
最佳迭代轮次: 87
最佳 AUC: 0.8489

特征重要性:
     feature  importance
     income         450
 experience         380
        age         120
       city          80
  education          50

信息

路径: /tech-stacks/lightgbm/examples/原生类别特征 + early stopping 分类实战.md
更新时间: 2026/5/31