NLTK

技术栈
AI 框架
nlpeducationtext-processingclassiclinguistics

概览

NLTK

NLTK(Natural Language Toolkit)是 Steven Bird 和 Edward Loper 于 2001 年创建的经典 NLP 教学库,几乎出现在每一本 NLP 教材和大学课程中。它提供 50+ 语料库、词性标注、句法分析、语义推理等全套 NLP 基础工具。

核心价值:

  • 教学首选:配套 O'Reilly 经典教材《Natural Language Processing with Python》
  • 丰富语料库:Brown Corpus / Gutenberg / WordNet / Twitter 等内置即用
  • 算法透明:所有算法纯 Python 实现,适合理解原理
  • 全面覆盖:分词 → 词干提取 → POS → chunking → 句法树 → 语义推理
  • 经典不死:大学 NLP 课程 90% 从 NLTK 入门

适用场景: NLP 教学、算法原型验证、语言学研究、理解基础算法。

安装

环境准备

  • Python:>= 3.8(推荐 3.10)
  • 磁盘:~500 MB(语料库和模型数据)

安装命令

安装核心库

pip install nltk

下载语料库(交互式)

import nltk
nltk.download()              # 打开图形下载界面
# 或命令行选择下载
nltk.download("punkt")       # 分词模型
nltk.download("averaged_perceptron_tagger")  # 词性标注
nltk.download("wordnet")     # WordNet 语义词典
nltk.download("stopwords")   # 停用词
nltk.download("brown")       # Brown 语料库

一键下载全部常用资源

import nltk
nltk.download("all")  # 约 3 GB,耐心等待

推荐最小集合

nltk.download([
    "punkt",                       # 分词
    "punkt_tab",                   # 新版本分词
    "averaged_perceptron_tagger",  # POS
    "maxent_ne_chunker",           # NER
    "words",                       # 英文词汇表
    "stopwords",                   # 停用词
    "wordnet",                     # WordNet
])

常见安装问题

Q1: Resource punkt not found

未下载分词模型。运行 nltk.download("punkt")。NLTK 3.8+ 还需 nltk.download("punkt_tab")

Q2: 下载超时

NLTK 数据从 GitHub 下载,国内可能慢。设置代理或手动下载 zip 到 ~/nltk_data/

Q3: punktpunkt_tab 的区别

NLTK 3.8 重构了分词模块,部分函数需要新的 punkt_tab。两个都下载最省心。

示例

NLTK 经典 NLP Pipeline

目标

展示 NLTK 的标准 NLP 处理流程:分词 → 词性标注 → 命名实体识别 → 词干提取 → 词频统计。

完整代码

import nltk
import ssl

# 首次运行取消注释:
# try: _create_unverified_https_context = ssl._create_unverified_context
# except: pass
# ssl._create_default_https_context = _create_unverified_https_context
# nltk.download(["punkt", "punkt_tab", "averaged_perceptron_tagger", 
#                 "maxent_ne_chunker", "words", "stopwords", "wordnet"])

from nltk import word_tokenize, pos_tag, ne_chunk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from collections import Counter

text = """
Elon Musk announced on Monday that Tesla's new factory in Shanghai 
will produce over 500,000 electric vehicles annually. 
The groundbreaking ceremony was attended by local officials and Tesla executives.
Tim Cook, the CEO of Apple, also visited Beijing last week.
"""

# ─── 1. 分词 ───
tokens = word_tokenize(text)
print(f"分词: {tokens[:10]}...")

# ─── 2. 去停用词 + 保留字母词 ───
stop_words = set(stopwords.words("english"))
clean_tokens = [w.lower() for w in tokens if w.isalpha() and w.lower() not in stop_words]
print(f"\n清洗后: {clean_tokens}")

# ─── 3. 词性标注 ───
pos_tags = pos_tag(tokens)
print(f"\n词性标注:")
for word, tag in pos_tags:
    if word.isalpha():
        print(f"  {word:<20} → {tag}")

# ─── 4. 命名实体识别 ───
ner_tree = ne_chunk(pos_tags)
print(f"\n命名实体:")
for subtree in ner_tree:
    if hasattr(subtree, "label"):
        entity = " ".join([leaf[0] for leaf in subtree.leaves()])
        print(f"  {subtree.label():<10} | {entity}")

# ─── 5. 词干提取 vs 词形还原 ───
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

print(f"\n词干 / 词形还原对比:")
for word in ["running", "vehicles", "factories", "better", "attended"]:
    print(f"  {word:<12} → stem: {stemmer.stem(word):<12} | lemma: {lemmatizer.lemmatize(word)}")

# ─── 6. 词频统计 ───
fdist = FreqDist(clean_tokens)
print(f"\nTop 10 高频词:")
for word, freq in fdist.most_common(10):
    print(f"  {word:<15} {freq}")

# ─── 7. WordNet 语义 ───
from nltk.corpus import wordnet
for synset in wordnet.synsets("vehicle", pos=wordnet.NOUN):
    print(f"\nSynset: {synset.name()}")
    print(f"  定义: {synset.definition()}")
    print(f"  例句: {synset.examples()}")

运行步骤

pip install nltk
python nltk_pipeline.py

预期输出

分词: ['Elon', 'Musk', 'announced', 'on', 'Monday', ...]

词性标注:
  Elon                 → NNP (专有名词)
  Musk                 → NNP
  announced            → VBD (动词过去式)
  Tesla                → NNP
  ...

命名实体:
  PERSON     | Elon Musk
  GPE        | Shanghai
  PERSON     | Tim Cook
  ORG        | Apple
  GPE        | Beijing

词干 / 词形还原对比:
  running      → stem: run          | lemma: running
  vehicles     → stem: vehicl       | lemma: vehicle
  factories    → stem: factori      | lemma: factory

Top 10 高频词:
  tesla            2
  musk             1
  new              1
  ...

教程

NLTK 入门教程:语料库、语法树与分类器

1. NLTK 的定位

NLTK 不是工业级工具(那是 spaCy 的事),而是NLP 教学的瑞士军刀。它的价值在于让你理解"分词到底在干什么"、"语法树怎么构建"——而不是黑盒调用。

「如果你只想知道结果,用 spaCy。如果你想理解为什么,用 NLTK。」

2. 内置语料库

from nltk.corpus import brown, gutenberg, reuters, inaugural

# Brown 语料库:100 万词的标注语料(新闻、小说、科技等 15 类)
print(brown.categories())
# ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', ...]

# Gutenberg:经典文学
print(gutenberg.fileids())
# ['austen-emma.txt', 'shakespeare-hamlet.txt', ...]

# 查看特定文本
emma = nltk.Text(gutenberg.words("austen-emma.txt"))
emma.concordance("love")  # 查看 "love" 的上下文

3. 条件频率分布

from nltk.probability import ConditionalFreqDist

cfd = ConditionalFreqDist(
    (genre, word)
    for genre in brown.categories()
    for word in brown.words(categories=genre)
)

# 各体裁中最常用的情态动词
genres = ["news", "romance", "science_fiction"]
modals = ["can", "could", "may", "might", "must", "will"]
cfd.tabulate(conditions=genres, samples=modals)

4. 上下文无关文法 (CFG)

grammar = nltk.CFG.fromstring("""
    S -> NP VP
    NP -> Det N | Det N PP | 'I'
    VP -> V NP | VP PP
    PP -> P NP
    Det -> 'the' | 'a'
    N -> 'dog' | 'cat' | 'park'
    V -> 'saw' | 'chased'
    P -> 'in' | 'with'
""")

parser = nltk.ChartParser(grammar)
sentence = "the dog saw a cat in the park".split()
for tree in parser.parse(sentence):
    tree.pretty_print()

CFG 帮助你理解"句法歧义"——同一句话可能有多种解析树。

5. 朴素贝叶斯文本分类

from nltk.classify import NaiveBayesClassifier

# 构造特征
def extract_features(words):
    return {word: True for word in words}

# 训练数据
pos_sents = [("I love this movie".split(), "positive"),
             ("Great film amazing acting".split(), "positive")]
neg_sents = [("Terrible movie waste of time".split(), "negative"),
             ("Boring and slow".split(), "negative")]

train_data = [(extract_features(words), label) for words, label in pos_sents + neg_sents]
classifier = NaiveBayesClassifier.train(train_data)

# 预测
test = extract_features("Great acting".split())
print(classifier.classify(test))  # positive
print(classifier.show_most_informative_features(5))

6. NLTK vs spaCy:教科书 vs 工业

任务 NLTK spaCy
教学学习 ⭐⭐⭐
分词 多算法可选 单一最优
NER 基于规则 基于统计(更准)
速度 慢(Python 实现) 极快(Cython)
句法树 ⭐⭐⭐(CFG) ⭐(仅依存)
语料库 ⭐⭐⭐ 丰富 ⭐ 无内置

7. 推荐学习路径

NLTK 入门(理解原理)
  → spaCy 实战(高效生产)
    → HuggingFace Transformers(SOTA 精度)
      → LangChain(LLM 应用)

思考题

  1. NLTK 的 Porter Stemmer 和 WordNet Lemmatizer 各有什么优缺点?
  2. 为什么工业界选 spaCy 而不是 NLTK?NLTK 有什么不可替代的地方?
  3. 用 NLTK 的 CFG 解析中文句子会遇到什么挑战?

参考资料

暂无参考文献