Scrapy

技术栈
工具链
scrapypython爬虫网络抓取数据采集异步

概览

Scrapy

Scrapy 是 Python 最强大的异步网络爬虫框架,由 Pablo Hoffman 和 Shane Evans 于 2008 年创建。它是一个完整的爬虫生态系统,不仅是 HTTP 请求库。

解决什么问题

  • 快速编写高性能、可扩展的网络爬虫
  • 提供完整的抓取→解析→存储 pipeline
  • 处理反爬、限速、代理、中间件等爬虫常见挑战

关键特性

  • 基于 Twisted 异步引擎,单机可达数千 QPS
  • CSS / XPath 选择器 + 正则表达式多方式提取数据
  • 中间件架构:Spider Middleware、Downloader Middleware
  • 内置 Item Pipeline 实现数据清洗、验证和持久化
  • 支持 FTP、S3、GCS 等多种文件存储后端
  • 媒体管道(ImagesPipeline)自动下载和处理图片

安装

环境准备

  • 操作系统: Linux / macOS 首选,Windows 可用但需注意编码问题
  • Python 版本: 3.8 及以上(推荐 3.10+)
  • 内核依赖: Twisted、lxml、parsel、cryptography(自动安装)

安装命令

# 基础安装
pip install scrapy

# 带常用辅助库
pip install scrapy scrapy-splash selenium-wire fake-useragent

# 验证安装
scrapy version

创建第一个项目

scrapy startproject myproject
cd myproject
scrapy genspider example example.com

常见安装问题

Q: Windows 上提示 Microsoft Visual C++ Build Tools 缺失
下载 Visual Studio Build Tools,勾选「C++ 生成工具」→ 安装。或使用:

pip install scrapy --only-binary :all:

Q: lxml 编译失败(macOS)

xcode-select --install
pip install lxml --no-cache-dir

Q: 运行时报 twisted 相关错误

pip install --upgrade twisted pyopenssl

Q: 虚拟环境内 scrapy 命令找不到

python -m scrapy version
# 或重新激活 venv
source venv/bin/activate

示例

Scrapy Hello World:爬取图书信息

目标

爬取 books.toscrape.com 上的图书标题、价格、评级,并保存为 JSON。

完整代码

1. 创建项目

scrapy startproject bookscraper
cd bookscraper
scrapy genspider books books.toscrape.com

2. Spider 代码

# bookscraper/spiders/books.py
import scrapy

class BooksSpider(scrapy.Spider):
    name = "books"
    allowed_domains = ["books.toscrape.com"]
    start_urls = ["http://books.toscrape.com/"]

    def parse(self, response):
        """解析图书列表页"""
        for book in response.css("article.product_pod"):
            yield {
                "title": book.css("h3 a::attr(title)").get(),
                "price": book.css("p.price_color::text").get(),
                "rating": book.css("p.star-rating::attr(class)").get().split()[-1],
                "url": response.urljoin(book.css("h3 a::attr(href)").get()),
            }

        # 翻页
        next_page = response.css("li.next a::attr(href)").get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

3. 运行爬虫

# 输出到 JSON
scrapy crawl books -o books.json

# 实时查看输出
scrapy crawl books

预期输出(books.json 片段)

[
  {"title": "A Light in the Attic", "price": "£51.77", "rating": "Three", "url": "..."},
  {"title": "Tipping the Velvet", "price": "£53.74", "rating": "One", "url": "..."},
  {"title": "Soumission", "price": "£50.10", "rating": "One", "url": "..."}
]

使用 XPath 的等价写法

def parse(self, response):
    for book in response.xpath('//article[@class="product_pod"]'):
        yield {
            "title": book.xpath('.//h3/a/@title').get(),
            "price": book.xpath('.//p[@class="price_color"]/text()').get(),
        }

教程

Scrapy 爬虫从入门到反爬对抗

背景

互联网数据是新时代的石油。Scrapy 让你用几十行 Python 代码就能构建一个工业级爬虫,自动处理请求调度、重试、限速和数据存储。


第 1 章:Scrapy 架构

[Spider] → [Engine] → [Scheduler] → [Downloader] → [Spider] → [Item Pipeline]
                ↓
           [Middleware]
  • Spider: 定义抓取逻辑和数据提取
  • Engine: 协调所有组件
  • Scheduler: URL 去重和调度队列
  • Downloader: 发送 HTTP 请求
  • Item Pipeline: 数据清洗/验证/存储

第 2 章:Item Pipeline — 数据清洗与存储

# items.py
import scrapy

class BookItem(scrapy.Item):
    title = scrapy.Field()
    price = scrapy.Field()
    rating = scrapy.Field()


# pipelines.py
import json
from scrapy.exceptions import DropItem

class PriceCleanPipeline:
    def process_item(self, item, spider):
        # £51.77 → 51.77
        raw = item.get("price", "£0")
        try:
            item["price_gbp"] = float(raw.replace("£", ""))
        except ValueError:
            raise DropItem(f"无效价格: {raw}")
        return item

class JsonWriterPipeline:
    def open_spider(self, spider):
        self.file = open("output.json", "w")
        self.file.write("[\n")

    def process_item(self, item, spider):
        line = json.dumps(dict(item), ensure_ascii=False)
        self.file.write(f"  {line},\n")
        return item

    def close_spider(self, spider):
        self.file.write("]\n")
        self.file.close()

settings.py 中启用:

ITEM_PIPELINES = {
    "bookscraper.pipelines.PriceCleanPipeline": 300,
    "bookscraper.pipelines.JsonWriterPipeline": 800,
}

第 3 章:中间件 — 反爬对抗

# middlewares.py — 自动切换 User-Agent
import random
from fake_useragent import UserAgent

USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ...",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) ...",
]

class RandomUserAgentMiddleware:
    def process_request(self, request, spider):
        request.headers["User-Agent"] = random.choice(USER_AGENTS)

# 自动限速(在 settings.py 中启用即可,无需自定义)
# AUTOTHROTTLE_ENABLED = True
# AUTOTHROTTLE_START_DELAY = 1
# DOWNLOAD_DELAY = 0.5

第 4 章:CrawlSpider — 规则化抓取

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class BlogSpider(CrawlSpider):
    name = "blog"
    allowed_domains = ["example.com"]
    start_urls = ["https://example.com/blog"]

    rules = (
        # 跟踪文章链接
        Rule(LinkExtractor(restrict_css="article h2 a"), callback="parse_article"),
        # 跟踪分页
        Rule(LinkExtractor(restrict_css="a.next"), follow=True),
    )

    def parse_article(self, response):
        yield {
            "title": response.css("h1::text").get(),
            "date": response.css("time::attr(datetime)").get(),
            "content": " ".join(response.css("article p::text").getall()),
        }

第 5 章:部署与监控

# 使用 Scrapyd 部署
pip install scrapyd scrapyd-client
scrapyd  # 启动服务

# 部署项目
scrapyd-deploy default

# 通过 API 调度
curl http://localhost:6800/schedule.json -d project=bookscraper -d spider=books

思考题

  1. Scrapy 的 DUPEFILTER_CLASS 如何自定义 URL 去重逻辑?
  2. 当目标网站使用 JS 渲染(SPA)时,Scrapy 有哪些集成方案?
  3. 如何优雅地处理 Scrapy 的 allowed_domains 限制与第三方 API 调用的冲突?