文档
Scrapy 爬虫从入门到反爬对抗
背景
互联网数据是新时代的石油。Scrapy 让你用几十行 Python 代码就能构建一个工业级爬虫,自动处理请求调度、重试、限速和数据存储。
第 1 章:Scrapy 架构
[Spider] → [Engine] → [Scheduler] → [Downloader] → [Spider] → [Item Pipeline]
↓
[Middleware]
- Spider: 定义抓取逻辑和数据提取
- Engine: 协调所有组件
- Scheduler: URL 去重和调度队列
- Downloader: 发送 HTTP 请求
- Item Pipeline: 数据清洗/验证/存储
第 2 章:Item Pipeline — 数据清洗与存储
# items.py
import scrapy
class BookItem(scrapy.Item):
title = scrapy.Field()
price = scrapy.Field()
rating = scrapy.Field()
# pipelines.py
import json
from scrapy.exceptions import DropItem
class PriceCleanPipeline:
def process_item(self, item, spider):
# £51.77 → 51.77
raw = item.get("price", "£0")
try:
item["price_gbp"] = float(raw.replace("£", ""))
except ValueError:
raise DropItem(f"无效价格: {raw}")
return item
class JsonWriterPipeline:
def open_spider(self, spider):
self.file = open("output.json", "w")
self.file.write("[\n")
def process_item(self, item, spider):
line = json.dumps(dict(item), ensure_ascii=False)
self.file.write(f" {line},\n")
return item
def close_spider(self, spider):
self.file.write("]\n")
self.file.close()
在 settings.py 中启用:
ITEM_PIPELINES = {
"bookscraper.pipelines.PriceCleanPipeline": 300,
"bookscraper.pipelines.JsonWriterPipeline": 800,
}
第 3 章:中间件 — 反爬对抗
# middlewares.py — 自动切换 User-Agent
import random
from fake_useragent import UserAgent
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ...",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) ...",
]
class RandomUserAgentMiddleware:
def process_request(self, request, spider):
request.headers["User-Agent"] = random.choice(USER_AGENTS)
# 自动限速(在 settings.py 中启用即可,无需自定义)
# AUTOTHROTTLE_ENABLED = True
# AUTOTHROTTLE_START_DELAY = 1
# DOWNLOAD_DELAY = 0.5
第 4 章:CrawlSpider — 规则化抓取
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class BlogSpider(CrawlSpider):
name = "blog"
allowed_domains = ["example.com"]
start_urls = ["https://example.com/blog"]
rules = (
# 跟踪文章链接
Rule(LinkExtractor(restrict_css="article h2 a"), callback="parse_article"),
# 跟踪分页
Rule(LinkExtractor(restrict_css="a.next"), follow=True),
)
def parse_article(self, response):
yield {
"title": response.css("h1::text").get(),
"date": response.css("time::attr(datetime)").get(),
"content": " ".join(response.css("article p::text").getall()),
}
第 5 章:部署与监控
# 使用 Scrapyd 部署
pip install scrapyd scrapyd-client
scrapyd # 启动服务
# 部署项目
scrapyd-deploy default
# 通过 API 调度
curl http://localhost:6800/schedule.json -d project=bookscraper -d spider=books
思考题
- Scrapy 的
DUPEFILTER_CLASS如何自定义 URL 去重逻辑? - 当目标网站使用 JS 渲染(SPA)时,Scrapy 有哪些集成方案?
- 如何优雅地处理 Scrapy 的
allowed_domains限制与第三方 API 调用的冲突?