文档

Scrapy Hello World：爬取图书信息

目标

爬取 books.toscrape.com 上的图书标题、价格、评级，并保存为 JSON。

完整代码

1. 创建项目

scrapy startproject bookscraper
cd bookscraper
scrapy genspider books books.toscrape.com

2. Spider 代码

# bookscraper/spiders/books.py
import scrapy

class BooksSpider(scrapy.Spider):
    name = "books"
    allowed_domains = ["books.toscrape.com"]
    start_urls = ["http://books.toscrape.com/"]

    def parse(self, response):
        """解析图书列表页"""
        for book in response.css("article.product_pod"):
            yield {
                "title": book.css("h3 a::attr(title)").get(),
                "price": book.css("p.price_color::text").get(),
                "rating": book.css("p.star-rating::attr(class)").get().split()[-1],
                "url": response.urljoin(book.css("h3 a::attr(href)").get()),
            }

        # 翻页
        next_page = response.css("li.next a::attr(href)").get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

3. 运行爬虫

# 输出到 JSON
scrapy crawl books -o books.json

# 实时查看输出
scrapy crawl books

预期输出（books.json 片段）

[
  {"title": "A Light in the Attic", "price": "£51.77", "rating": "Three", "url": "..."},
  {"title": "Tipping the Velvet", "price": "£53.74", "rating": "One", "url": "..."},
  {"title": "Soumission", "price": "£50.10", "rating": "One", "url": "..."}
]

使用 XPath 的等价写法

def parse(self, response):
    for book in response.xpath('//article[@class="product_pod"]'):
        yield {
            "title": book.xpath('.//h3/a/@title').get(),
            "price": book.xpath('.//p[@class="price_color"]/text()').get(),
        }

信息

路径: /tech-stacks/scrapy/examples/Hello World — 爬取图书信息.md
更新时间: 2026/5/30