文档
Scrapy Hello World:爬取图书信息
目标
爬取 books.toscrape.com 上的图书标题、价格、评级,并保存为 JSON。
完整代码
1. 创建项目
scrapy startproject bookscraper
cd bookscraper
scrapy genspider books books.toscrape.com
2. Spider 代码
# bookscraper/spiders/books.py
import scrapy
class BooksSpider(scrapy.Spider):
name = "books"
allowed_domains = ["books.toscrape.com"]
start_urls = ["http://books.toscrape.com/"]
def parse(self, response):
"""解析图书列表页"""
for book in response.css("article.product_pod"):
yield {
"title": book.css("h3 a::attr(title)").get(),
"price": book.css("p.price_color::text").get(),
"rating": book.css("p.star-rating::attr(class)").get().split()[-1],
"url": response.urljoin(book.css("h3 a::attr(href)").get()),
}
# 翻页
next_page = response.css("li.next a::attr(href)").get()
if next_page:
yield response.follow(next_page, callback=self.parse)
3. 运行爬虫
# 输出到 JSON
scrapy crawl books -o books.json
# 实时查看输出
scrapy crawl books
预期输出(books.json 片段)
[
{"title": "A Light in the Attic", "price": "£51.77", "rating": "Three", "url": "..."},
{"title": "Tipping the Velvet", "price": "£53.74", "rating": "One", "url": "..."},
{"title": "Soumission", "price": "£50.10", "rating": "One", "url": "..."}
]
使用 XPath 的等价写法
def parse(self, response):
for book in response.xpath('//article[@class="product_pod"]'):
yield {
"title": book.xpath('.//h3/a/@title').get(),
"price": book.xpath('.//p[@class="price_color"]/text()').get(),
}