【零基础学爬虫】第五章：scrapy数据解析实战（二）

一、项目准备

1.创建工程

scrapy startproject qiubaiPro

2.创建爬虫文件

需求：爬取糗事百科中“段子”栏中的数据：https://www.qiushibaike.com/text/，解析作者名称+段子内容。

cd qiubaiPro

scrapy genspider qiubai https://www.qiushibaike.com/text/

3.修改配置文件

ROBOTSTXT_OBEY = False

# 显示指定类型的日志信息
LOG_LEVEL = 'ERROR'

# 修改UA伪装
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'

二、程序编写

1.编写程序

import scrapy


class QiubaiSpider(scrapy.Spider):
    name = 'qiubai'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://www.qiushibaike.com/text/']

    def parse(self, response):
        # 解析作者的名称+段子内容
        # 这里的xpath和etree.xpath不是同一个方法，但是用法几乎一样
        div_list = response.xpath('//div[@class="col1 old-style-col1"]/div')
        for div in div_list:
            # xpath返回的是列表，但是列表元素一定是selector类型的对象
            # extract 可以将selector对象中data参数存储的字符串提取出来
            author = div.xpath('./div[1]/a[2]/h2/text()')[0].extract()
            # 或者用extract_first，作用是将列表中第0个列表元素对应的selector进行extract操作。
            # 列表中只有一个列表元素的时候，可以用extract_first
            # author = div.xpath('./div[1]/a[2]/h2/text()').extract_first()
            # 列表调用了extract后，表示将列表中每一个selector对象中data对应的字符串提取了出来
            content = div.xpath('./a[1]/div/span//text()').extract()   # 因为文本中有<br>标签，要用//
            # 将列表转换为字符串
            content = ''.join(content)
            print(author, content)
            break

注意：scrapy中的xpath返回的列表中是一个Selector对象，如需转换成字符串，应该用extract把Selector中data对应的字符串取出来。

2.执行程序

scrapy crawl qiubai

三、持久化存储

1.基于终端指令存储

①在上面代码的基础上，把数据存在一个字典中。

import scrapy


class QiubaiSpider(scrapy.Spider):
    name = 'qiubai'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://www.qiushibaike.com/text/']

# 基于终端的存储

    def parse(self, response):
        # 解析作者的名称+段子内容
        # 这里的xpath和etree.xpath不是同一个方法，但是用法几乎一样
        div_list = response.xpath('//div[@class="col1 old-style-col1"]/div')
        all_data = []
        for div in div_list:
            # xpath返回的是列表，但是列表元素一定是selector类型的对象
            # extract 可以将selector对象中data参数存储的字符串提取出来
            author = div.xpath('./div[1]/a[2]/h2/text()')[0].extract()
            # author = div.xpath('./div[1]/a[2]/h2/text()').extract_first()
            # 列表调用了extract后，表示将列表中每一个selector对象中data对应的字符串提取了出来
            content = div.xpath('./a[1]/div/span//text()').extract()   # 因为文本中有<br>标签，要用//
            # 将列表转换为字符串
            content = ''.join(content)
            dic = {
                'author' : author,
                'content': content
            }
            all_data.append(dic)
        return all_data

②在终端输入命令进行存储

scrapy crawl qiubai -o ./qiubai.csv

③注意事项

- 要求：只可以将parse方法的返回值存储到本地的文本文件中

- 注意：持久化存储对应的文本文件的类型只可以为：'json', 'jsonlines', 'jl', 'csv', 'xml', 'marshal', 'pickle

- 指令：scrapy crawl xxx -o filePath

- 好处：简介高效便捷

- 缺点：局限性比较强（数据只可以存储到指定后缀的文本文件中）

2.基于管道存储

①编码流程：

- 数据解析

- 在item类中定义相关的属性（修改items.py）

- 将解析的数据封装存储到item类型的对象

- 将item类型的对象提交给管道进行持久化存储的操作

- 在管道类的process_item中要将其接受到的item对象中存储的数据进行持久化存储操作

- 在配置文件中开启管道

②修改items.py文件

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class QiubaiproItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    author = scrapy.Field()
    content = scrapy.Field()
    pass

③修改pipelines.py文件

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class QiubaiproPipeline:
    fp = None
    # 重写父类的一个方法，该方法只在开始爬虫的时候被调用一次
    def open_spider(self,spider):
        print("开始爬虫.......")
        self.fp = open('./qiubai.txt','w',encoding='utf-8')
    # 专门用来处理item类型对象的
    # 该方法可以接收爬虫文件提交过来的item对象
    # 该方法每接收到1个item就会被调用一次
    def process_item(self, item, spider):
        author = item['author']
        content = item['content']
        self.fp.write(author+':'+content+'\n')

        return item
    def close_spider(self,spider):
        print("结束爬虫！！")
        self.fp.close()

④修改settings.py文件

# 开启管道，把文件中的注释删掉
ITEM_PIPELINES = {
   'qiubaiPro.pipelines.QiubaiproPipeline': 300,
}
# 300表示的是优先级，数值越小优先级越高

⑤完整主程序

import scrapy
from qiubaiPro.items import QiubaiproItem

class QiubaiSpider(scrapy.Spider):
    name = 'qiubai'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://www.qiushibaike.com/text/']


# 基于管道的存储

    def parse(self, response):
        # 解析作者的名称+段子内容
        # 这里的xpath和etree.xpath不是同一个方法，但是用法几乎一样
        div_list = response.xpath('//div[@class="col1 old-style-col1"]/div')
        all_data = []
        for div in div_list:
            # xpath返回的是列表，但是列表元素一定是selector类型的对象
            # extract 可以将selector对象中data参数存储的字符串提取出来
            author = div.xpath('./div[1]/a[2]/h2/text()')[0].extract()
            # author = div.xpath('./div[1]/a[2]/h2/text()').extract_first()
            # 列表调用了extract后，表示将列表中每一个selector对象中data对应的字符串提取了出来
            content = div.xpath('./a[1]/div/span//text()').extract()   # 因为文本中有<br>标签，要用//
            # 将列表转换为字符串
            content = ''.join(content)

            # 将解析的数据封装到item类型的对象中
            item = QiubaiproItem()
            item['author'] = author
            item['content'] = content

            # 将item提交给了管道
            yield item