spaCy 入门教程:Pipeline 组件与自定义 NER
1. spaCy 的 Pipeline 架构
spaCy 采用模块化 Pipeline 设计,每个组件有确定的输入和输出:
Text → tokenizer → tagger → parser → ner → ... → Doc
查看当前 Pipeline:
import spacy
nlp = spacy.load("en_core_web_sm")
print(nlp.pipe_names)
# ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
2. 组件详解
| 组件 |
功能 |
输出 |
| tokenizer |
分词 |
Token 序列 |
| tok2vec |
词→向量 |
Token 向量 |
| tagger |
词性标注 |
token.pos_, token.tag_ |
| parser |
依存句法 |
token.dep_, token.head |
| ner |
命名实体 |
doc.ents |
| lemmatizer |
词形还原 |
token.lemma_ |
3. Matcher:规则匹配
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
# 模式:形容词 + 名词
pattern = [{"POS": "ADJ"}, {"POS": "NOUN"}]
matcher.add("AdjNoun", [pattern])
doc = nlp("The quick brown fox jumps over the lazy dog")
matches = matcher(doc)
for match_id, start, end in matches:
print(doc[start:end]) # quick fox, brown fox, lazy dog
4. EntityRuler:自定义实体规则
from spacy.pipeline import EntityRuler
nlp = spacy.load("en_core_web_sm")
ruler = nlp.add_pipe("entity_ruler", before="ner")
patterns = [
{"label": "PRODUCT", "pattern": "iPhone 15"},
{"label": "PRODUCT", "pattern": [{"LOWER": "macbook"}, {"LOWER": "pro"}]},
{"label": "SKILL", "pattern": [{"LOWER": {"IN": ["python", "java", "rust"]}}]},
]
ruler.add_patterns(patterns)
doc = nlp("I use Python and Rust on my MacBook Pro.")
print([(ent.text, ent.label_) for ent in doc.ents])
# [('Python', 'SKILL'), ('Rust', 'SKILL'), ('MacBook Pro', 'PRODUCT')]
5. 自定义 NER 训练
import spacy
from spacy.training import Example
import random
# 从空白模型开始
nlp = spacy.blank("en")
ner = nlp.add_pipe("ner")
ner.add_label("TECH")
# 训练数据
TRAIN_DATA = [
("I love PyTorch", {"entities": [(7, 14, "TECH")]}),
("TensorFlow is from Google", {"entities": [(0, 10, "TECH")]}),
("FastAPI makes APIs easy", {"entities": [(0, 7, "TECH")]}),
]
# 开始训练
optimizer = nlp.begin_training()
for epoch in range(30):
random.shuffle(TRAIN_DATA)
losses = {}
for text, annots in TRAIN_DATA:
doc = nlp.make_doc(text)
example = Example.from_dict(doc, annots)
nlp.update([example], drop=0.5, losses=losses)
if epoch % 10 == 0:
print(f"Epoch {epoch}: Loss {losses['ner']:.4f}")
# 测试
doc = nlp("I started learning PyTorch yesterday")
print([(ent.text, ent.label_) for ent in doc.ents])
6. spaCy vs NLTK vs Stanza
| 维度 |
spaCy |
NLTK |
Stanza |
| 速度 |
⭐⭐⭐ 极快 |
⭐ 慢 |
⭐⭐ 中等 |
| 易用性 |
⭐⭐⭐ |
⭐⭐ |
⭐⭐ |
| 学术支持 |
较少 |
丰富(教材) |
丰富(Stanford) |
| 生产就绪 |
⭐⭐⭐ |
⭐ |
⭐⭐ |
| 多语言 |
75+ |
有限 |
70+ |
思考题
- spaCy 的
doc.ents 和 Matcher 都是提取实体,什么时候用哪个?
- 为什么 Pipeline 顺序重要?把
ner 放在 parser 前面会怎样?
- 训练自定义 NER 时,多少条标注数据才能达到可用精度?