scrapy爬虫框架介绍

By admin in 4858.com on 2019年8月31日

scrapy,

 一、介绍

   Scrapy贰个开源和搭档的框架,其初期是为着页面抓取 (更合适来讲,
互联网抓取
)所设计的,使用它能够以异常快、轻巧、可扩充的点子从网址中领取所需的数额。但眼下Scrapy的用处丰富常见,可用来如数据发掘、监测和自动化测验等世界,也足以采纳在猎取API所再次回到的数据(比如亚马逊(Amazon) Associates Web Services ) 也许通用的互连网爬虫。

    Scrapy
是依附twisted框架开采而来,twisted是八个流行的事件驱动的python互连网框架。因而Scrapy使用了一种非阻塞(又名异步)的代码来落到实处产出。全体架构大概如下

4858.com 1

The data flow in Scrapy is controlled by the execution engine, and
goes like this:

  1. The Engine gets the initial Requests to crawl from the Spider.
  2. The Engine schedules the Requests in the Scheduler and asks for the
    next Requests to crawl.
  3. The Scheduler returns the next Requests to the Engine.
  4. The Engine sends the Requests to the Downloader, passing through
    the Downloader Middlewares (see process_request()).
  5. Once the page finishes downloading the Downloader generates a
    Response (with that page) and sends it to the Engine, passing
    through the Downloader Middlewares (see process_response()).
  6. The Engine receives the Response from the Downloader and sends it to
    the Spider for processing, passing through the Spider
    Middleware (see process_spider_input()).
  7. The Spider processes the Response and returns scraped items and new
    Requests (to follow) to the Engine, passing through the Spider
    Middleware (see process_spider_output()).
  8. The Engine sends processed items to Item Pipelines, then send
    processed Requests to the Scheduler and asks for possible next
    Requests to crawl.
  9. The process repeats (from step 1) until there are no more requests
    from the Scheduler.

 Components:

身处EGINE和SPIDE福特ExplorerS之间,主要办事是管理SPIDE奥迪Q3S的输入(即responses)和输出(即requests)

官方网址链接:

二、安装

 

#Windows平台
    1、pip3 install wheel #安装后,便支持通过wheel文件安装软件,wheel文件官网:https://www.lfd.uci.edu/~gohlke/pythonlibs
    3、pip3 install lxml
    4、pip3 install pyopenssl
    5、下载并安装pywin32:https://sourceforge.net/projects/pywin32/files/pywin32/
    6、下载twisted的wheel文件:http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
    7、执行pip3 install 下载目录\Twisted-17.9.0-cp36-cp36m-win_amd64.whl
    8、pip3 install scrapy

#Linux平台
    1、pip3 install scrapy

 

三、命令行工具

#1 查看帮助
    scrapy -h
    scrapy <command> -h

#2 有两种命令:其中Project-only必须切到项目文件夹下才能执行,而Global的命令则不需要
    Global commands:
        startproject #创建项目
        genspider    #创建爬虫程序
        settings     #如果是在项目目录下,则得到的是该项目的配置
        runspider    #运行一个独立的python文件,不必创建项目
        shell        #scrapy shell url地址  在交互式调试,如选择器规则正确与否
        fetch        #独立于程单纯地爬取一个页面,可以拿到请求头
        view         #下载完毕后直接弹出浏览器,以此可以分辨出哪些数据是ajax请求
        version      #scrapy version 查看scrapy的版本,scrapy version -v查看scrapy依赖库的版本
    Project-only commands:
        crawl        #运行爬虫,必须创建项目才行,确保配置文件中ROBOTSTXT_OBEY = False
        check        #检测项目中有无语法错误
        list         #列出项目中所包含的爬虫名
        edit         #编辑器,一般不用
        parse        #scrapy parse url地址 --callback 回调函数  #以此可以验证我们的回调函数是否正确
        bench        #scrapy bentch压力测试

#3 官网链接
    https://docs.scrapy.org/en/latest/topics/commands.html

4858.com 2

#1、执行全局命令:请确保不在某个项目的目录下,排除受该项目配置的影响
scrapy startproject MyProject

cd MyProject
scrapy genspider baidu www.baidu.com

scrapy settings --get XXX #如果切换到项目目录下,看到的则是该项目的配置

scrapy runspider baidu.py

scrapy shell https://www.baidu.com
    response
    response.status
    response.body
    view(response)

scrapy view https://www.taobao.com #如果页面显示内容不全,不全的内容则是ajax请求实现的,以此快速定位问题

scrapy fetch --nolog --headers https://www.taobao.com

scrapy version #scrapy的版本

scrapy version -v #依赖库的版本


#2、执行项目命令:切到项目目录下
scrapy crawl baidu
scrapy check
scrapy list
scrapy parse http://quotes.toscrape.com/ --callback parse
scrapy bench

示范用法

四、项目布局以及爬虫应用简要介绍

project_name/
   scrapy.cfg
   project_name/
       __init__.py
       items.py
       pipelines.py
       settings.py
       spiders/
           __init__.py
           爬虫1.py
           爬虫2.py
           爬虫3.py

文本表明:

  • scrapy.cfg
     项目标主配置新闻,用来铺排scrapy时采纳,爬虫相关的布置音讯在settings.py文件中。
  • items.py    设置数据存储模板,用于结构化数据,如:Django的Model
  • pipelines    数据管理作为,如:一般结构化的数量长久化
  • settings.py
    配置文件,如:递归的层数、并发数,延迟下载等。重申:配置文件的选项必得大写不然正是无效,正确写法USE奇骏_AGENT=’xxxx’
  • spiders      爬虫目录,如:创设文件,编写爬虫法规

注意:一般创制爬虫文件时,以网址域名命名

4858.com 3

#在项目目录下新建:entrypoint.py
from scrapy.cmdline import execute
execute(['scrapy', 'crawl', 'xiaohua'])

暗中同意只好在cmd中实施爬虫,假诺想在pycharm中试行供给做
4858.com 4

import sys,os
sys.stdout=io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030')

关于windows编码

五、Spiders

1、介绍

#1、Spiders是由一系列类(定义了一个网址或一组网址将被爬取)组成,具体包括如何执行爬取任务并且如何从页面中提取结构化的数据。

#2、换句话说,Spiders是你为了一个特定的网址或一组网址自定义爬取和解析页面行为的地方

2、Spiders会循环做如下事情

#1、生成初始的Requests来爬取第一个URLS,并且标识一个回调函数
第一个请求定义在start_requests()方法内默认从start_urls列表中获得url地址来生成Request请求,默认的回调函数是parse方法。回调函数在下载完成返回response时自动触发

#2、在回调函数中,解析response并且返回值
返回值可以4种:
        包含解析数据的字典
        Item对象
        新的Request对象(新的Requests也需要指定一个回调函数)
        或者是可迭代对象(包含Items或Request)

#3、在回调函数中解析页面内容
通常使用Scrapy自带的Selectors,但很明显你也可以使用Beutifulsoup,lxml或其他你爱用啥用啥。

#4、最后,针对返回的Items对象将会被持久化到数据库
通过Item Pipeline组件存到数据库:https://docs.scrapy.org/en/latest/topics/item-pipeline.html#topics-item-pipeline)
或者导出到不同的文件(通过Feed exports:https://docs.scrapy.org/en/latest/topics/feed-exports.html#topics-feed-exports)

3、Spiders总共提供了五连串

#1、scrapy.spiders.Spider #scrapy.Spider等同于scrapy.spiders.Spider
#2、scrapy.spiders.CrawlSpider
#3、scrapy.spiders.XMLFeedSpider
#4、scrapy.spiders.CSVFeedSpider
#5、scrapy.spiders.SitemapSpider

4、导入使用

# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import Spider,CrawlSpider,XMLFeedSpider,CSVFeedSpider,SitemapSpider

class AmazonSpider(scrapy.Spider): #自定义类,继承Spiders提供的基类
    name = 'amazon'
    allowed_domains = ['www.amazon.cn']
    start_urls = ['http://www.amazon.cn/']

    def parse(self, response):
        pass

5、class scrapy.spiders.spider

那是最简便的spider类,任何其余的spider类都急需持续它(包括你本身定义的)。

该类不提供别的例外的功效,它仅提供了三个暗中同意的start_requests方法暗许从start_urls中读取url地址发送requests央浼,並且私下认可parse作为回调函数

class AmazonSpider(scrapy.Spider):
    name = 'amazon' 

    allowed_domains = ['www.amazon.cn'] 

    start_urls = ['http://www.amazon.cn/']

    custom_settings = {
        'BOT_NAME' : 'Egon_Spider_Amazon',
        'REQUEST_HEADERS' : {
          'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
          'Accept-Language': 'en',
        }
    }

    def parse(self, response):
        pass

4858.com 5

#1、name = 'amazon' 
定义爬虫名,scrapy会根据该值定位爬虫程序
所以它必须要有且必须唯一(In Python 2 this must be ASCII only.)

#2、allowed_domains = ['www.amazon.cn'] 
定义允许爬取的域名,如果OffsiteMiddleware启动(默认就启动),
那么不属于该列表的域名及其子域名都不允许爬取
如果爬取的网址为:https://www.example.com/1.html,那就添加'example.com'到列表.

#3、start_urls = ['http://www.amazon.cn/']
如果没有指定url,就从该列表中读取url来生成第一个请求

#4、custom_settings
值为一个字典,定义一些配置信息,在运行爬虫程序时,这些配置会覆盖项目级别的配置
所以custom_settings必须被定义成一个类属性,由于settings会在类实例化前被加载

#5、settings
通过self.settings['配置项的名字']可以访问settings.py中的配置,如果自己定义了custom_settings还是以自己的为准

#6、logger
日志名默认为spider的名字
self.logger.debug('=============>%s' %self.settings['BOT_NAME'])

#5、crawler:了解
该属性必须被定义到类方法from_crawler中

#6、from_crawler(crawler, *args, **kwargs):了解
You probably won’t need to override this directly  because the default implementation acts as a proxy to the __init__() method, calling it with the given arguments args and named arguments kwargs.

#7、start_requests()
该方法用来发起第一个Requests请求,且必须返回一个可迭代的对象。它在爬虫程序打开时就被Scrapy调用,Scrapy只调用它一次。
默认从start_urls里取出每个url来生成Request(url, dont_filter=True)

#针对参数dont_filter,请看自定义去重规则

如果你想要改变起始爬取的Requests,你就需要覆盖这个方法,例如你想要起始发送一个POST请求,如下
class MySpider(scrapy.Spider):
    name = 'myspider'

    def start_requests(self):
        return [scrapy.FormRequest("http://www.example.com/login",
                                   formdata={'user': 'john', 'pass': 'secret'},
                                   callback=self.logged_in)]

    def logged_in(self, response):
        # here you would extract links to follow and return Requests for
        # each of them, with another callback
        pass

#8、parse(response)
这是默认的回调函数,所有的回调函数必须返回an iterable of Request and/or dicts or Item objects.

#9、log(message[, level, component]):了解
Wrapper that sends a log message through the Spider’s logger, kept for backwards compatibility. For more information see Logging from Spiders.

#10、closed(reason)
爬虫程序结束时自动触发

定制scrapy.spider属性与措施详解
4858.com 6

去重规则应该多个爬虫共享的,但凡一个爬虫爬取了,其他都不要爬了,实现方式如下

#方法一:
1、新增类属性
visited=set() #类属性

2、回调函数parse方法内:
def parse(self, response):
    if response.url in self.visited:
        return None
    .......

    self.visited.add(response.url) 

#方法一改进:针对url可能过长,所以我们存放url的hash值
def parse(self, response):
        url=md5(response.request.url)
    if url in self.visited:
        return None
    .......

    self.visited.add(url) 

#方法二:Scrapy自带去重功能
配置文件:
DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter' #默认的去重规则帮我们去重,去重规则在内存中
DUPEFILTER_DEBUG = False
JOBDIR = "保存范文记录的日志路径,如:/root/"  # 最终路径为 /root/requests.seen,去重规则放文件中

scrapy自带去重规则默认为RFPDupeFilter,只需要我们指定
Request(...,dont_filter=False) ,如果dont_filter=True则告诉Scrapy这个URL不参与去重。

#方法三:
我们也可以仿照RFPDupeFilter自定义去重规则,

from scrapy.dupefilter import RFPDupeFilter,看源码,仿照BaseDupeFilter

#步骤一:在项目目录下自定义去重文件dup.py
class UrlFilter(object):
    def __init__(self):
        self.visited = set() #或者放到数据库

    @classmethod
    def from_settings(cls, settings):
        return cls()

    def request_seen(self, request):
        if request.url in self.visited:
            return True
        self.visited.add(request.url)

    def open(self):  # can return deferred
        pass

    def close(self, reason):  # can return a deferred
        pass

    def log(self, request, spider):  # log that a request has been filtered
        pass

#步骤二:配置文件settings.py:
DUPEFILTER_CLASS = '项目名.dup.UrlFilter'


# 源码分析:
from scrapy.core.scheduler import Scheduler
见Scheduler下的enqueue_request方法:self.df.request_seen(request)

去重准绳:去除重复的url
4858.com 7

#例一:
import scrapy

class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = [
        'http://www.example.com/1.html',
        'http://www.example.com/2.html',
        'http://www.example.com/3.html',
    ]

    def parse(self, response):
        self.logger.info('A response from %s just arrived!', response.url)


#例二:一个回调函数返回多个Requests和Items
import scrapy

class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = [
        'http://www.example.com/1.html',
        'http://www.example.com/2.html',
        'http://www.example.com/3.html',
    ]

    def parse(self, response):
        for h3 in response.xpath('//h3').extract():
            yield {"title": h3}

        for url in response.xpath('//a/@href').extract():
            yield scrapy.Request(url, callback=self.parse)


#例三:在start_requests()内直接指定起始爬取的urls,start_urls就没有用了,

import scrapy
from myproject.items import MyItem

class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['example.com']

    def start_requests(self):
        yield scrapy.Request('http://www.example.com/1.html', self.parse)
        yield scrapy.Request('http://www.example.com/2.html', self.parse)
        yield scrapy.Request('http://www.example.com/3.html', self.parse)

    def parse(self, response):
        for h3 in response.xpath('//h3').extract():
            yield MyItem(title=h3)

        for url in response.xpath('//a/@href').extract():
            yield scrapy.Request(url, callback=self.parse)

例子
4858.com 8

我们可能需要在命令行为爬虫程序传递参数,比如传递初始的url,像这样
#命令行执行
scrapy crawl myspider -a category=electronics

#在__init__方法中可以接收外部传进来的参数
import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'

    def __init__(self, category=None, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.start_urls = ['http://www.example.com/categories/%s' % category]
        #...


#注意接收的参数全都是字符串,如果想要结构化的数据,你需要用类似json.loads的方法

参数字传送递

6、别的通用Spiders:

六、Selectors

#1 //与/
#2 text
#3、extract与extract_first:从selector对象中解出内容
#4、属性:xpath的属性加前缀@
#4、嵌套查找
#5、设置默认值
#4、按照属性查找
#5、按照属性模糊查找
#6、正则表达式
#7、xpath相对路径
#8、带变量的xpath

4858.com 9

response.selector.css()
response.selector.xpath()
可简写为
response.css()
response.xpath()

#1 //与/
response.xpath('//body/a/')#
response.css('div a::text')

>>> response.xpath('//body/a') #开头的//代表从整篇文档中寻找,body之后的/代表body的儿子
[]
>>> response.xpath('//body//a') #开头的//代表从整篇文档中寻找,body之后的//代表body的子子孙孙
[<Selector xpath='//body//a' data='<a href="image1.html">Name: My image 1 <'>, <Selector xpath='//body//a' data='<a href="image2.html">Name: My image 2 <'>, <Selector xpath='//body//a' data='<a href="
image3.html">Name: My image 3 <'>, <Selector xpath='//body//a' data='<a href="image4.html">Name: My image 4 <'>, <Selector xpath='//body//a' data='<a href="image5.html">Name: My image 5 <'>]

#2 text
>>> response.xpath('//body//a/text()')
>>> response.css('body a::text')

#3、extract与extract_first:从selector对象中解出内容
>>> response.xpath('//div/a/text()').extract()
['Name: My image 1 ', 'Name: My image 2 ', 'Name: My image 3 ', 'Name: My image 4 ', 'Name: My image 5 ']
>>> response.css('div a::text').extract()
['Name: My image 1 ', 'Name: My image 2 ', 'Name: My image 3 ', 'Name: My image 4 ', 'Name: My image 5 ']

>>> response.xpath('//div/a/text()').extract_first()
'Name: My image 1 '
>>> response.css('div a::text').extract_first()
'Name: My image 1 '

#4、属性:xpath的属性加前缀@
>>> response.xpath('//div/a/@href').extract_first()
'image1.html'
>>> response.css('div a::attr(href)').extract_first()
'image1.html'

#4、嵌套查找
>>> response.xpath('//div').css('a').xpath('@href').extract_first()
'image1.html'

#5、设置默认值
>>> response.xpath('//div[@id="xxx"]').extract_first(default="not found")
'not found'

#4、按照属性查找
response.xpath('//div[@id="images"]/a[@href="image3.html"]/text()').extract()
response.css('#images a[@href="image3.html"]/text()').extract()

#5、按照属性模糊查找
response.xpath('//a[contains(@href,"image")]/@href').extract()
response.css('a[href*="image"]::attr(href)').extract()

response.xpath('//a[contains(@href,"image")]/img/@src').extract()
response.css('a[href*="imag"] img::attr(src)').extract()

response.xpath('//*[@href="image1.html"]')
response.css('*[href="image1.html"]')

#6、正则表达式
response.xpath('//a/text()').re(r'Name: (.*)')
response.xpath('//a/text()').re_first(r'Name: (.*)')

#7、xpath相对路径
>>> res=response.xpath('//a[contains(@href,"3")]')[0]
>>> res.xpath('img')
[<Selector xpath='img' data='<img src="image3_thumb.jpg">'>]
>>> res.xpath('./img')
[<Selector xpath='./img' data='<img src="image3_thumb.jpg">'>]
>>> res.xpath('.//img')
[<Selector xpath='.//img' data='<img src="image3_thumb.jpg">'>]
>>> res.xpath('//img') #这就是从头开始扫描
[<Selector xpath='//img' data='<img src="image1_thumb.jpg">'>, <Selector xpath='//img' data='<img src="image2_thumb.jpg">'>, <Selector xpath='//img' data='<img src="image3_thumb.jpg">'>, <Selector xpa
th='//img' data='<img src="image4_thumb.jpg">'>, <Selector xpath='//img' data='<img src="image5_thumb.jpg">'>]

#8、带变量的xpath
>>> response.xpath('//div[@id=$xxx]/a/text()',xxx='images').extract_first()
'Name: My image 1 '
>>> response.xpath('//div[count(a)=$yyy]/@id',yyy=5).extract_first() #求有5个a标签的div的id
'images'

View Code

七、Items

 

 

 

一、介绍
Scrapy一个开源和搭档的框架,其早期是为了页面抓取 (更贴切来讲, 互联网抓取
)所设计的,使用它能够以飞快、轻松、可扩…

scrapy爬虫框架介绍,scrapy爬虫框架

一 介绍

开卷目录

一 介绍

    Scrapy三个开源和合营的框架,其开始的一段时代是为着页面抓取 (更适合来讲,
互联网抓取
)所设计的,使用它能够以高速、简单、可增添的措施从网址中领取所需的数量。但近来Scrapy的用处丰裕大面积,可用以如数据开采、监测和自动化测量试验等世界,也能够动用在取得API所重回的多少(比如亚马逊 Associates Web 瑟维斯s ) 恐怕通用的网络爬虫。

    Scrapy
是依靠twisted框架开垦而来,twisted是二个盛行的事件驱动的python网络框架。由此Scrapy使用了一种非阻塞(又名异步)的代码来达成产出。全部架构大约如下
4858.com 10

The data flow in Scrapy is controlled by the execution engine, and
goes like this:

  1. The Engine gets the initial Requests to crawl from the Spider.
  2. The Engine schedules the Requests in the Scheduler and asks for the
    next Requests to crawl.
  3. The Scheduler returns the next Requests to the Engine.
  4. The Engine sends the Requests to the Downloader, passing through
    the Downloader Middlewares (see process_request()).
  5. Once the page finishes downloading the Downloader generates a
    Response (with that page) and sends it to the Engine, passing
    through the Downloader Middlewares (see process_response()).
  6. The Engine receives the Response from the Downloader and sends it to
    the Spider for processing, passing through the Spider
    Middleware (see process_spider_input()).
  7. The Spider processes the Response and returns scraped items and new
    Requests (to follow) to the Engine, passing through the Spider
    Middleware (see process_spider_output()).
  8. The Engine sends processed items to Item Pipelines, then send
    processed Requests to the Scheduler and asks for possible next
    Requests to crawl.
  9. The process repeats (from step 1) until there are no more requests
    from the Scheduler.

 

Components:

爬虫中间件(Spider Middlewares)
位居EGINE和SPIDE本田CR-VS之间,首要事业是拍卖SPIDE奥德赛S的输入(即responses)和出口(即requests)

scrapy爬虫框架介绍。官网链接:

二 安装


  • 介绍

  • 安装

  • 命令行工具

  • 项目布局以及爬虫应用简要介绍 

  • Spiders

  • Selectors

  • Items
  • 八 Item
    Pipeline

  • Dowloader
    Middeware

  • Spider
    Middleware
  • 十一
    自定义扩大
  • 十二
    settings.py
  • 十三
    爬取亚马逊(亚马逊)商品消息

二 安装

#Windows平台
    1、pip3 install wheel #安装后,便支持通过wheel文件安装软件,wheel文件官网:https://www.lfd.uci.edu/~gohlke/pythonlibs
    3、pip3 install lxml
    4、pip3 install pyopenssl
    5、下载并安装pywin32:https://sourceforge.net/projects/pywin32/files/pywin32/
    6、下载twisted的wheel文件:http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
    7、执行pip3 install 下载目录\Twisted-17.9.0-cp36-cp36m-win_amd64.whl
    8、pip3 install scrapy

#Linux平台
    1、pip3 install scrapy

三 命令行工具

一 介绍

    Scrapy二个开源和搭档的框架,个中期是为了页面抓取 (更合适来讲,
互连网抓取
)所安顿的,使用它可以以便捷、不难、可扩展的主意从网址中领到所需的数量。但眼下Scrapy的用途丰裕大范围,可用来如数据发掘、监测和自动化测试等世界,也足以采用在收获API所再次来到的数额(比方亚马逊(Amazon) Associates Web Services ) 可能通用的网络爬虫。

    Scrapy
是依照twisted框架开垦而来,twisted是七个风靡的事件驱动的python互连网框架。由此Scrapy使用了一种非阻塞(又名异步)的代码来落成产出。全体架构大概如下
4858.com 11

The data flow in Scrapy is controlled by
the execution engine, and goes like this:

  1. The Engine gets
    the initial Requests to crawl from
    the Spider.
  2. The Engine schedules
    the Requests in
    the Scheduler and
    asks for the next Requests to crawl.
  3. The Scheduler returns
    the next Requests to
    the Engine.
  4. The Engine sends
    the Requests to
    the Downloader,
    passing through the Downloader
    Middlewares (see process_request()).
  5. Once the page finishes downloading
    the Downloader generates
    a Response (with that page) and sends it to the Engine, passing
    through the Downloader
    Middlewares (see process_response()).
  6. The Engine receives
    the Response from
    the Downloader and
    sends it to
    the Spider for
    processing, passing through the Spider
    Middleware (see process_spider_input()).
  7. The Spider processes
    the Response and returns scraped items and new Requests (to follow)
    to
    the Engine,
    passing through the Spider
    Middleware (see process_spider_output()).
  8. The Engine sends
    processed items to Item
    Pipelines,
    then send processed Requests to
    the Scheduler and
    asks for possible next Requests to crawl.
  9. The process repeats (from step 1) until there are no more requests
    from
    the Scheduler.

 

Components:

  1. 引擎(EGINE)
    引擎肩负控制种类具备组件之间的数据流,并在好几动作发生时接触事件。有关详细新闻,请参见上面的数据流部分。

  2. 调度器(SCHEDULER)
    用来接受引擎发过来的乞请, 压入队列中, 并在斯特林发动机再一次恳请的时候再次来到.
    能够想像成一个UOdysseyL的优先级队列, 由它来调节下四个要抓取的网站是什么,
    同期去除重复的网站

  3. 下载器(DOWLOADER)
    用于下载网页内容,
    并将网页内容再次回到给EGINE,下载器是起家在twisted那一个高速的异步模型上的
  4. 爬虫(SPIDERS)
    SPIDE安德拉S是开辟职员自定义的类,用来深入分析responses,何况提取items,或然发送新的诉求
  5. 花色管道(ITEM PIPLINES)
    在items被提取后负担管理它们,首要包涵清理、验证、持久化(举个例子存到数据库)等操作
  6. 下载器中间件(Downloader Middlewares)
    座落Scrapy引擎和下载器之间,主要用于管理从EGINE传到DOWLOADEENVISION的呼吁request,已经从DOWNLOADE奥迪Q3传到EGINE的响应response,你可用该中间件做以下几件事

    1. process a request just before it is sent to the Downloader (i.e.
      right before Scrapy sends the request to the website);
    2. change received response before passing it to a spider;
    3. send a new Request instead of passing received response to a
      spider;
    4. pass response to a spider without fetching a web page;
    5. silently drop some requests.
  7. 爬虫中间件(Spider Middlewares)
    身处EGINE和SPIDE奔驰M级S之间,首要职业是管理SPIDE传祺S的输入(即responses)和出口(即requests)

官方网站链接:https://docs.scrapy.org/en/latest/topics/architecture.html

三 命令行工具

#1 查看帮助
    scrapy -h
    scrapy <command> -h

#2 有两种命令:其中Project-only必须切到项目文件夹下才能执行,而Global的命令则不需要
    Global commands:
        startproject #创建项目
        genspider    #创建爬虫程序
        settings     #如果是在项目目录下,则得到的是该项目的配置
        runspider    #运行一个独立的python文件,不必创建项目
        shell        #scrapy shell url地址  在交互式调试,如选择器规则正确与否
        fetch        #独立于程单纯地爬取一个页面,可以拿到请求头
        view         #下载完毕后直接弹出浏览器,以此可以分辨出哪些数据是ajax请求
        version      #scrapy version 查看scrapy的版本,scrapy version -v查看scrapy依赖库的版本
    Project-only commands:
        crawl        #运行爬虫,必须创建项目才行,确保配置文件中ROBOTSTXT_OBEY = False
        check        #检测项目中有无语法错误
        list         #列出项目中所包含的爬虫名
        edit         #编辑器,一般不用
        parse        #scrapy parse url地址 --callback 回调函数  #以此可以验证我们的回调函数是否正确
        bench        #scrapy bentch压力测试

#3 官网链接
    https://docs.scrapy.org/en/latest/topics/commands.html

4858.com 12

#1、执行全局命令:请确保不在某个项目的目录下,排除受该项目配置的影响
scrapy startproject MyProject

cd MyProject
scrapy genspider baidu www.baidu.com

scrapy settings --get XXX #如果切换到项目目录下,看到的则是该项目的配置

scrapy runspider baidu.py

scrapy shell https://www.baidu.com
    response
    response.status
    response.body
    view(response)

scrapy view https://www.taobao.com #如果页面显示内容不全,不全的内容则是ajax请求实现的,以此快速定位问题

scrapy fetch --nolog --headers https://www.taobao.com

scrapy version #scrapy的版本

scrapy version -v #依赖库的版本


#2、执行项目命令:切到项目目录下
scrapy crawl baidu
scrapy check
scrapy list
scrapy parse http://quotes.toscrape.com/ --callback parse
scrapy bench

演示用法


项目结构以及爬虫应用简单介绍

二 安装

4858.com 13

#Windows平台
    1、pip3 install wheel #安装后,便支持通过wheel文件安装软件,wheel文件官网:https://www.lfd.uci.edu/~gohlke/pythonlibs
    3、pip3 install lxml
    4、pip3 install pyopenssl
    5、下载并安装pywin32:https://sourceforge.net/projects/pywin32/files/pywin32/
    6、下载twisted的wheel文件:http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
    7、执行pip3 install 下载目录\Twisted-17.9.0-cp36-cp36m-win_amd64.whl
    8、pip3 install scrapy

#Linux平台
    1、pip3 install scrapy

4858.com 14

四 项目结构以及爬虫应用简要介绍 

project_name/
   scrapy.cfg
   project_name/
       __init__.py
       items.py
       pipelines.py
       settings.py
       spiders/
           __init__.py
           爬虫1.py
           爬虫2.py
           爬虫3.py

文本表达:

  • scrapy.cfg
     项目标主配置音讯,用来布置scrapy时使用,爬虫相关的配置新闻在settings.py文件中。
  • items.py    设置数据存储模板,用于结构化数据,如:Django的Model
  • pipelines    数据管理作为,如:一般结构化的数码持久化
  • settings.py
    配置文件,如:递归的层数、并发数,延迟下载等。强调:配置文件的选项必得大写不然就是无效,正确写法USEWrangler_AGENT=’xxxx’
  • spiders      爬虫目录,如:创建文件,编写爬虫法规

注意:一般创设爬虫文件时,以网址域名命名

4858.com 15

#在项目目录下新建:entrypoint.py
from scrapy.cmdline import execute
execute(['scrapy', 'crawl', 'xiaohua'])

默许只好在cmd中实施爬虫,假设想在pycharm中举行要求做
4858.com 16

import sys,os
sys.stdout=io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030')

关于windows编码

五 Spiders

三 命令行工具

4858.com 17

#1 查看帮助
    scrapy -h
    scrapy <command> -h

#2 有两种命令:其中Project-only必须切到项目文件夹下才能执行,而Global的命令则不需要
    Global commands:
        startproject #创建项目
        genspider    #创建爬虫程序
        settings     #如果是在项目目录下,则得到的是该项目的配置
        runspider    #运行一个独立的python文件,不必创建项目
        shell        #scrapy shell url地址  在交互式调试,如选择器规则正确与否
        fetch        #独立于程单纯地爬取一个页面,可以拿到请求头
        view         #下载完毕后直接弹出浏览器,以此可以分辨出哪些数据是ajax请求
        version      #scrapy version 查看scrapy的版本,scrapy version -v查看scrapy依赖库的版本
    Project-only commands:
        crawl        #运行爬虫,必须创建项目才行,确保配置文件中ROBOTSTXT_OBEY = False
        check        #检测项目中有无语法错误
        list         #列出项目中所包含的爬虫名
        edit         #编辑器,一般不用
        parse        #scrapy parse url地址 --callback 回调函数  #以此可以验证我们的回调函数是否正确
        bench        #scrapy bentch压力测试

#3 官网链接
    https://docs.scrapy.org/en/latest/topics/commands.html

4858.com 18

4858.com 194858.com 20

#1、执行全局命令:请确保不在某个项目的目录下,排除受该项目配置的影响
scrapy startproject MyProject

cd MyProject
scrapy genspider baidu www.baidu.com

scrapy settings --get XXX #如果切换到项目目录下,看到的则是该项目的配置

scrapy runspider baidu.py

scrapy shell https://www.baidu.com
    response
    response.status
    response.body
    view(response)

scrapy view https://www.taobao.com #如果页面显示内容不全,不全的内容则是ajax请求实现的,以此快速定位问题

scrapy fetch --nolog --headers https://www.taobao.com

scrapy version #scrapy的版本

scrapy version -v #依赖库的版本


#2、执行项目命令:切到项目目录下
scrapy crawl baidu
scrapy check
scrapy list
scrapy parse http://quotes.toscrape.com/ --callback parse
scrapy bench

示范用法

五 Spiders

1、介绍

#1、Spiders是由一系列类(定义了一个网址或一组网址将被爬取)组成,具体包括如何执行爬取任务并且如何从页面中提取结构化的数据。

#2、换句话说,Spiders是你为了一个特定的网址或一组网址自定义爬取和解析页面行为的地方

2、Spiders会循环做如下事情

#1、生成初始的Requests来爬取第一个URLS,并且标识一个回调函数
第一个请求定义在start_requests()方法内默认从start_urls列表中获得url地址来生成Request请求,默认的回调函数是parse方法。回调函数在下载完成返回response时自动触发

#2、在回调函数中,解析response并且返回值
返回值可以4种:
        包含解析数据的字典
        Item对象
        新的Request对象(新的Requests也需要指定一个回调函数)
        或者是可迭代对象(包含Items或Request)

#3、在回调函数中解析页面内容
通常使用Scrapy自带的Selectors,但很明显你也可以使用Beutifulsoup,lxml或其他你爱用啥用啥。

#4、最后,针对返回的Items对象将会被持久化到数据库
通过Item Pipeline组件存到数据库:https://docs.scrapy.org/en/latest/topics/item-pipeline.html#topics-item-pipeline)
或者导出到不同的文件(通过Feed exports:https://docs.scrapy.org/en/latest/topics/feed-exports.html#topics-feed-exports)

3、Spiders总共提供了五门类:

#1、scrapy.spiders.Spider #scrapy.Spider等同于scrapy.spiders.Spider
#2、scrapy.spiders.CrawlSpider
#3、scrapy.spiders.XMLFeedSpider
#4、scrapy.spiders.CSVFeedSpider
#5、scrapy.spiders.SitemapSpider

4、导入使用

# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import Spider,CrawlSpider,XMLFeedSpider,CSVFeedSpider,SitemapSpider

class AmazonSpider(scrapy.Spider): #自定义类,继承Spiders提供的基类
    name = 'amazon'
    allowed_domains = ['www.amazon.cn']
    start_urls = ['http://www.amazon.cn/']

    def parse(self, response):
        pass

5、class scrapy.spiders.Spider

那是最简易的spider类,任何另外的spider类都亟需一而再它(包蕴你和睦定义的)。

此类不提供任何异样的意义,它仅提供了三个默许的start_requests方法私下认可从start_urls中读取url地址发送requests央求,並且暗许parse作为回调函数

class AmazonSpider(scrapy.Spider):
    name = 'amazon' 

    allowed_domains = ['www.amazon.cn'] 

    start_urls = ['http://www.amazon.cn/']

    custom_settings = {
        'BOT_NAME' : 'Egon_Spider_Amazon',
        'REQUEST_HEADERS' : {
          'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
          'Accept-Language': 'en',
        }
    }

    def parse(self, response):
        pass

4858.com 21

#1、name = 'amazon' 
定义爬虫名,scrapy会根据该值定位爬虫程序
所以它必须要有且必须唯一(In Python 2 this must be ASCII only.)

#2、allowed_domains = ['www.amazon.cn'] 
定义允许爬取的域名,如果OffsiteMiddleware启动(默认就启动),
那么不属于该列表的域名及其子域名都不允许爬取
如果爬取的网址为:https://www.example.com/1.html,那就添加'example.com'到列表.

#3、start_urls = ['http://www.amazon.cn/']
如果没有指定url,就从该列表中读取url来生成第一个请求

#4、custom_settings
值为一个字典,定义一些配置信息,在运行爬虫程序时,这些配置会覆盖项目级别的配置
所以custom_settings必须被定义成一个类属性,由于settings会在类实例化前被加载

#5、settings
通过self.settings['配置项的名字']可以访问settings.py中的配置,如果自己定义了custom_settings还是以自己的为准

#6、logger
日志名默认为spider的名字
self.logger.debug('=============>%s' %self.settings['BOT_NAME'])

#5、crawler:了解
该属性必须被定义到类方法from_crawler中

#6、from_crawler(crawler, *args, **kwargs):了解
You probably won’t need to override this directly  because the default implementation acts as a proxy to the __init__() method, calling it with the given arguments args and named arguments kwargs.

#7、start_requests()
该方法用来发起第一个Requests请求,且必须返回一个可迭代的对象。它在爬虫程序打开时就被Scrapy调用,Scrapy只调用它一次。
默认从start_urls里取出每个url来生成Request(url, dont_filter=True)

#针对参数dont_filter,请看自定义去重规则

如果你想要改变起始爬取的Requests,你就需要覆盖这个方法,例如你想要起始发送一个POST请求,如下
class MySpider(scrapy.Spider):
    name = 'myspider'

    def start_requests(self):
        return [scrapy.FormRequest("http://www.example.com/login",
                                   formdata={'user': 'john', 'pass': 'secret'},
                                   callback=self.logged_in)]

    def logged_in(self, response):
        # here you would extract links to follow and return Requests for
        # each of them, with another callback
        pass

#8、parse(response)
这是默认的回调函数,所有的回调函数必须返回an iterable of Request and/or dicts or Item objects.

#9、log(message[, level, component]):了解
Wrapper that sends a log message through the Spider’s logger, kept for backwards compatibility. For more information see Logging from Spiders.

#10、closed(reason)
爬虫程序结束时自动触发

定制scrapy.spider属性与格局详解
4858.com 22

去重规则应该多个爬虫共享的,但凡一个爬虫爬取了,其他都不要爬了,实现方式如下

#方法一:
1、新增类属性
visited=set() #类属性

2、回调函数parse方法内:
def parse(self, response):
    if response.url in self.visited:
        return None
    .......

    self.visited.add(response.url) 

#方法一改进:针对url可能过长,所以我们存放url的hash值
def parse(self, response):
        url=md5(response.request.url)
    if url in self.visited:
        return None
    .......

    self.visited.add(url) 

#方法二:Scrapy自带去重功能
配置文件:
DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter' #默认的去重规则帮我们去重,去重规则在内存中
DUPEFILTER_DEBUG = False
JOBDIR = "保存范文记录的日志路径,如:/root/"  # 最终路径为 /root/requests.seen,去重规则放文件中

scrapy自带去重规则默认为RFPDupeFilter,只需要我们指定
Request(...,dont_filter=False) ,如果dont_filter=True则告诉Scrapy这个URL不参与去重。

#方法三:
我们也可以仿照RFPDupeFilter自定义去重规则,

from scrapy.dupefilter import RFPDupeFilter,看源码,仿照BaseDupeFilter

#步骤一:在项目目录下自定义去重文件dup.py
class UrlFilter(object):
    def __init__(self):
        self.visited = set() #或者放到数据库

    @classmethod
    def from_settings(cls, settings):
        return cls()

    def request_seen(self, request):
        if request.url in self.visited:
            return True
        self.visited.add(request.url)

    def open(self):  # can return deferred
        pass

    def close(self, reason):  # can return a deferred
        pass

    def log(self, request, spider):  # log that a request has been filtered
        pass

#步骤二:配置文件settings.py:
DUPEFILTER_CLASS = '项目名.dup.UrlFilter'


# 源码分析:
from scrapy.core.scheduler import Scheduler
见Scheduler下的enqueue_request方法:self.df.request_seen(request)

去重准绳:去除重复的url
4858.com 23

#例一:
import scrapy

class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = [
        'http://www.example.com/1.html',
        'http://www.example.com/2.html',
        'http://www.example.com/3.html',
    ]

    def parse(self, response):
        self.logger.info('A response from %s just arrived!', response.url)


#例二:一个回调函数返回多个Requests和Items
import scrapy

class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = [
        'http://www.example.com/1.html',
        'http://www.example.com/2.html',
        'http://www.example.com/3.html',
    ]

    def parse(self, response):
        for h3 in response.xpath('//h3').extract():
            yield {"title": h3}

        for url in response.xpath('//a/@href').extract():
            yield scrapy.Request(url, callback=self.parse)


#例三:在start_requests()内直接指定起始爬取的urls,start_urls就没有用了,

import scrapy
from myproject.items import MyItem

class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['example.com']

    def start_requests(self):
        yield scrapy.Request('http://www.example.com/1.html', self.parse)
        yield scrapy.Request('http://www.example.com/2.html', self.parse)
        yield scrapy.Request('http://www.example.com/3.html', self.parse)

    def parse(self, response):
        for h3 in response.xpath('//h3').extract():
            yield MyItem(title=h3)

        for url in response.xpath('//a/@href').extract():
            yield scrapy.Request(url, callback=self.parse)

例子
4858.com 24

我们可能需要在命令行为爬虫程序传递参数,比如传递初始的url,像这样
#命令行执行
scrapy crawl myspider -a category=electronics

#在__init__方法中可以接收外部传进来的参数
import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'

    def __init__(self, category=None, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.start_urls = ['http://www.example.com/categories/%s' % category]
        #...


#注意接收的参数全都是字符串,如果想要结构化的数据,你需要用类似json.loads的方法

参数字传送递

6、别的通用Spiders:

六 Selectors

四 项目组织以及爬虫应用简要介绍 

4858.com 25

project_name/
   scrapy.cfg
   project_name/
       __init__.py
       items.py
       pipelines.py
       settings.py
       spiders/
           __init__.py
           爬虫1.py
           爬虫2.py
           爬虫3.py

4858.com 26

文件表明:

  • scrapy.cfg
     项指标主配置音信,用来布局scrapy时利用,爬虫相关的配置音讯在settings.py文件中。
  • items.py    设置数据存款和储蓄模板,用于结构化数据,如:Django的Model
  • pipelines    数据管理作为,如:一般结构化的数量长久化
  • settings.py 配置文件,如:递归的层数、并发数,延迟下载等。重申:配置文件的选项必需大写不然正是无效,正确写法USECR-V_AGENT=’xxxx’
  • spiders      爬虫目录,如:创立文件,编写爬虫法则

静心:一般创制爬虫文件时,以网址域名命名

4858.com 274858.com 28

#在项目目录下新建:entrypoint.py
from scrapy.cmdline import execute
execute(['scrapy', 'crawl', 'xiaohua'])

暗中认可只可以在cmd中实行爬虫,若是想在pycharm中实施必要做

4858.com 294858.com 30

import sys,os
sys.stdout=io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030')

关于windows编码

六 Selectors

#1 //与/
#2 text
#3、extract与extract_first:从selector对象中解出内容
#4、属性:xpath的属性加前缀@
#4、嵌套查找
#5、设置默认值
#4、按照属性查找
#5、按照属性模糊查找
#6、正则表达式
#7、xpath相对路径
#8、带变量的xpath

4858.com 31

response.selector.css()
response.selector.xpath()
可简写为
response.css()
response.xpath()

#1 //与/
response.xpath('//body/a/')#
response.css('div a::text')

>>> response.xpath('//body/a') #开头的//代表从整篇文档中寻找,body之后的/代表body的儿子
[]
>>> response.xpath('//body//a') #开头的//代表从整篇文档中寻找,body之后的//代表body的子子孙孙
[<Selector xpath='//body//a' data='<a href="image1.html">Name: My image 1 <'>, <Selector xpath='//body//a' data='<a href="image2.html">Name: My image 2 <'>, <Selector xpath='//body//a' data='<a href="
image3.html">Name: My image 3 <'>, <Selector xpath='//body//a' data='<a href="image4.html">Name: My image 4 <'>, <Selector xpath='//body//a' data='<a href="image5.html">Name: My image 5 <'>]

#2 text
>>> response.xpath('//body//a/text()')
>>> response.css('body a::text')

#3、extract与extract_first:从selector对象中解出内容
>>> response.xpath('//div/a/text()').extract()
['Name: My image 1 ', 'Name: My image 2 ', 'Name: My image 3 ', 'Name: My image 4 ', 'Name: My image 5 ']
>>> response.css('div a::text').extract()
['Name: My image 1 ', 'Name: My image 2 ', 'Name: My image 3 ', 'Name: My image 4 ', 'Name: My image 5 ']

>>> response.xpath('//div/a/text()').extract_first()
'Name: My image 1 '
>>> response.css('div a::text').extract_first()
'Name: My image 1 '

#4、属性:xpath的属性加前缀@
>>> response.xpath('//div/a/@href').extract_first()
'image1.html'
>>> response.css('div a::attr(href)').extract_first()
'image1.html'

#4、嵌套查找
>>> response.xpath('//div').css('a').xpath('@href').extract_first()
'image1.html'

#5、设置默认值
>>> response.xpath('//div[@id="xxx"]').extract_first(default="not found")
'not found'

#4、按照属性查找
response.xpath('//div[@id="images"]/a[@href="image3.html"]/text()').extract()
response.css('#images a[@href="image3.html"]/text()').extract()

#5、按照属性模糊查找
response.xpath('//a[contains(@href,"image")]/@href').extract()
response.css('a[href*="image"]::attr(href)').extract()

response.xpath('//a[contains(@href,"image")]/img/@src').extract()
response.css('a[href*="imag"] img::attr(src)').extract()

response.xpath('//*[@href="image1.html"]')
response.css('*[href="image1.html"]')

#6、正则表达式
response.xpath('//a/text()').re(r'Name: (.*)')
response.xpath('//a/text()').re_first(r'Name: (.*)')

#7、xpath相对路径
>>> res=response.xpath('//a[contains(@href,"3")]')[0]
>>> res.xpath('img')
[<Selector xpath='img' data='<img src="image3_thumb.jpg">'>]
>>> res.xpath('./img')
[<Selector xpath='./img' data='<img src="image3_thumb.jpg">'>]
>>> res.xpath('.//img')
[<Selector xpath='.//img' data='<img src="image3_thumb.jpg">'>]
>>> res.xpath('//img') #这就是从头开始扫描
[<Selector xpath='//img' data='<img src="image1_thumb.jpg">'>, <Selector xpath='//img' data='<img src="image2_thumb.jpg">'>, <Selector xpath='//img' data='<img src="image3_thumb.jpg">'>, <Selector xpa
th='//img' data='<img src="image4_thumb.jpg">'>, <Selector xpath='//img' data='<img src="image5_thumb.jpg">'>]

#8、带变量的xpath
>>> response.xpath('//div[@id=$xxx]/a/text()',xxx='images').extract_first()
'Name: My image 1 '
>>> response.xpath('//div[count(a)=$yyy]/@id',yyy=5).extract_first() #求有5个a标签的div的id
'images'

View Code

七 Items

五 Spiders

1、介绍

#1、Spiders是由一系列类(定义了一个网址或一组网址将被爬取)组成,具体包括如何执行爬取任务并且如何从页面中提取结构化的数据。

#2、换句话说,Spiders是你为了一个特定的网址或一组网址自定义爬取和解析页面行为的地方

2、Spiders会循环做如下事情

4858.com 32

#1、生成初始的Requests来爬取第一个URLS,并且标识一个回调函数
第一个请求定义在start_requests()方法内默认从start_urls列表中获得url地址来生成Request请求,默认的回调函数是parse方法。回调函数在下载完成返回response时自动触发

#2、在回调函数中,解析response并且返回值
返回值可以4种:
        包含解析数据的字典
        Item对象
        新的Request对象(新的Requests也需要指定一个回调函数)
        或者是可迭代对象(包含Items或Request)

#3、在回调函数中解析页面内容
通常使用Scrapy自带的Selectors,但很明显你也可以使用Beutifulsoup,lxml或其他你爱用啥用啥。

#4、最后,针对返回的Items对象将会被持久化到数据库
通过Item Pipeline组件存到数据库:https://docs.scrapy.org/en/latest/topics/item-pipeline.html#topics-item-pipeline)
或者导出到不同的文件(通过Feed exports:https://docs.scrapy.org/en/latest/topics/feed-exports.html#topics-feed-exports)

4858.com 33

3、Spiders总共提供了五档案的次序:

#1、scrapy.spiders.Spider #scrapy.Spider等同于scrapy.spiders.Spider
#2、scrapy.spiders.CrawlSpider
#3、scrapy.spiders.XMLFeedSpider
#4、scrapy.spiders.CSVFeedSpider
#5、scrapy.spiders.SitemapSpider

4、导入使用

4858.com 34

# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import Spider,CrawlSpider,XMLFeedSpider,CSVFeedSpider,SitemapSpider

class AmazonSpider(scrapy.Spider): #自定义类,继承Spiders提供的基类
    name = 'amazon'
    allowed_domains = ['www.amazon.cn']
    start_urls = ['http://www.amazon.cn/']

    def parse(self, response):
        pass

4858.com 35

5、class scrapy.spiders.Spider

那是最简易的spider类,任何别的的spider类都亟需三番五次它(富含你自个儿定义的)。

此类不提供别的至极的效劳,它仅提供了一个暗中认可的start_requests方法暗许从start_urls中读取url地址发送requests诉求,并且默许parse作为回调函数

4858.com 36

class AmazonSpider(scrapy.Spider):
    name = 'amazon' 

    allowed_domains = ['www.amazon.cn'] 

    start_urls = ['http://www.amazon.cn/']

    custom_settings = {
        'BOT_NAME' : 'Egon_Spider_Amazon',
        'REQUEST_HEADERS' : {
          'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
          'Accept-Language': 'en',
        }
    }

    def parse(self, response):
        pass

4858.com 37

4858.com 384858.com 39

#1、name = 'amazon' 
定义爬虫名,scrapy会根据该值定位爬虫程序
所以它必须要有且必须唯一(In Python 2 this must be ASCII only.)

#2、allowed_domains = ['www.amazon.cn'] 
定义允许爬取的域名,如果OffsiteMiddleware启动(默认就启动),
那么不属于该列表的域名及其子域名都不允许爬取
如果爬取的网址为:https://www.example.com/1.html,那就添加'example.com'到列表.

#3、start_urls = ['http://www.amazon.cn/']
如果没有指定url,就从该列表中读取url来生成第一个请求

#4、custom_settings
值为一个字典,定义一些配置信息,在运行爬虫程序时,这些配置会覆盖项目级别的配置
所以custom_settings必须被定义成一个类属性,由于settings会在类实例化前被加载

#5、settings
通过self.settings['配置项的名字']可以访问settings.py中的配置,如果自己定义了custom_settings还是以自己的为准

#6、logger
日志名默认为spider的名字
self.logger.debug('=============>%s' %self.settings['BOT_NAME'])

#5、crawler:了解
该属性必须被定义到类方法from_crawler中

#6、from_crawler(crawler, *args, **kwargs):了解
You probably won’t need to override this directly  because the default implementation acts as a proxy to the __init__() method, calling it with the given arguments args and named arguments kwargs.

#7、start_requests()
该方法用来发起第一个Requests请求,且必须返回一个可迭代的对象。它在爬虫程序打开时就被Scrapy调用,Scrapy只调用它一次。
默认从start_urls里取出每个url来生成Request(url, dont_filter=True)

#针对参数dont_filter,请看自定义去重规则

如果你想要改变起始爬取的Requests,你就需要覆盖这个方法,例如你想要起始发送一个POST请求,如下
class MySpider(scrapy.Spider):
    name = 'myspider'

    def start_requests(self):
        return [scrapy.FormRequest("http://www.example.com/login",
                                   formdata={'user': 'john', 'pass': 'secret'},
                                   callback=self.logged_in)]

    def logged_in(self, response):
        # here you would extract links to follow and return Requests for
        # each of them, with another callback
        pass

#8、parse(response)
这是默认的回调函数,所有的回调函数必须返回an iterable of Request and/or dicts or Item objects.

#9、log(message[, level, component]):了解
Wrapper that sends a log message through the Spider’s logger, kept for backwards compatibility. For more information see Logging from Spiders.

#10、closed(reason)
爬虫程序结束时自动触发

定制scrapy.spider属性与艺术详解

4858.com 404858.com 41

去重规则应该多个爬虫共享的,但凡一个爬虫爬取了,其他都不要爬了,实现方式如下

#方法一:
1、新增类属性
visited=set() #类属性

2、回调函数parse方法内:
def parse(self, response):
    if response.url in self.visited:
        return None
    .......

    self.visited.add(response.url) 

#方法一改进:针对url可能过长,所以我们存放url的hash值
def parse(self, response):
        url=md5(response.request.url)
    if url in self.visited:
        return None
    .......

    self.visited.add(url) 

#方法二:Scrapy自带去重功能
配置文件:
DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter' #默认的去重规则帮我们去重,去重规则在内存中
DUPEFILTER_DEBUG = False
JOBDIR = "保存范文记录的日志路径,如:/root/"  # 最终路径为 /root/requests.seen,去重规则放文件中

scrapy自带去重规则默认为RFPDupeFilter,只需要我们指定
Request(...,dont_filter=False) ,如果dont_filter=True则告诉Scrapy这个URL不参与去重。

#方法三:
我们也可以仿照RFPDupeFilter自定义去重规则,

from scrapy.dupefilter import RFPDupeFilter,看源码,仿照BaseDupeFilter

#步骤一:在项目目录下自定义去重文件dup.py
class UrlFilter(object):
    def __init__(self):
        self.visited = set() #或者放到数据库

    @classmethod
    def from_settings(cls, settings):
        return cls()

    def request_seen(self, request):
        if request.url in self.visited:
            return True
        self.visited.add(request.url)

    def open(self):  # can return deferred
        pass

    def close(self, reason):  # can return a deferred
        pass

    def log(self, request, spider):  # log that a request has been filtered
        pass

#步骤二:配置文件settings.py:
DUPEFILTER_CLASS = '项目名.dup.UrlFilter'


# 源码分析:
from scrapy.core.scheduler import Scheduler
见Scheduler下的enqueue_request方法:self.df.request_seen(request)

去重法则:去除重复的url

4858.com 424858.com 43

#例一:
import scrapy

class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = [
        'http://www.example.com/1.html',
        'http://www.example.com/2.html',
        'http://www.example.com/3.html',
    ]

    def parse(self, response):
        self.logger.info('A response from %s just arrived!', response.url)


#例二:一个回调函数返回多个Requests和Items
import scrapy

class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = [
        'http://www.example.com/1.html',
        'http://www.example.com/2.html',
        'http://www.example.com/3.html',
    ]

    def parse(self, response):
        for h3 in response.xpath('//h3').extract():
            yield {"title": h3}

        for url in response.xpath('//a/@href').extract():
            yield scrapy.Request(url, callback=self.parse)


#例三:在start_requests()内直接指定起始爬取的urls,start_urls就没有用了,

import scrapy
from myproject.items import MyItem

class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['example.com']

    def start_requests(self):
        yield scrapy.Request('http://www.example.com/1.html', self.parse)
        yield scrapy.Request('http://www.example.com/2.html', self.parse)
        yield scrapy.Request('http://www.example.com/3.html', self.parse)

    def parse(self, response):
        for h3 in response.xpath('//h3').extract():
            yield MyItem(title=h3)

        for url in response.xpath('//a/@href').extract():
            yield scrapy.Request(url, callback=self.parse)

例子

4858.com 444858.com 45

我们可能需要在命令行为爬虫程序传递参数,比如传递初始的url,像这样
#命令行执行
scrapy crawl myspider -a category=electronics

#在__init__方法中可以接收外部传进来的参数
import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'

    def __init__(self, category=None, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.start_urls = ['http://www.example.com/categories/%s' % category]
        #...


#注意接收的参数全都是字符串,如果想要结构化的数据,你需要用类似json.loads的方法

参数字传送递

6、其他通用Spiders:https://docs.scrapy.org/en/latest/topics/spiders.html\#generic-spiders

七 Items

八 Item
Pipeline

六 Selectors

4858.com 46

#1 //与/
#2 text
#3、extract与extract_first:从selector对象中解出内容
#4、属性:xpath的属性加前缀@
#4、嵌套查找
#5、设置默认值
#4、按照属性查找
#5、按照属性模糊查找
#6、正则表达式
#7、xpath相对路径
#8、带变量的xpath

4858.com 47

4858.com 484858.com 49

response.selector.css()
response.selector.xpath()
可简写为
response.css()
response.xpath()

#1 //与/
response.xpath('//body/a/')#
response.css('div a::text')

>>> response.xpath('//body/a') #开头的//代表从整篇文档中寻找,body之后的/代表body的儿子
[]
>>> response.xpath('//body//a') #开头的//代表从整篇文档中寻找,body之后的//代表body的子子孙孙
[<Selector xpath='//body//a' data='<a href="image1.html">Name: My image 1 <'>, <Selector xpath='//body//a' data='<a href="image2.html">Name: My image 2 <'>, <Selector xpath='//body//a' data='<a href="
image3.html">Name: My image 3 <'>, <Selector xpath='//body//a' data='<a href="image4.html">Name: My image 4 <'>, <Selector xpath='//body//a' data='<a href="image5.html">Name: My image 5 <'>]

#2 text
>>> response.xpath('//body//a/text()')
>>> response.css('body a::text')

#3、extract与extract_first:从selector对象中解出内容
>>> response.xpath('//div/a/text()').extract()
['Name: My image 1 ', 'Name: My image 2 ', 'Name: My image 3 ', 'Name: My image 4 ', 'Name: My image 5 ']
>>> response.css('div a::text').extract()
['Name: My image 1 ', 'Name: My image 2 ', 'Name: My image 3 ', 'Name: My image 4 ', 'Name: My image 5 ']

>>> response.xpath('//div/a/text()').extract_first()
'Name: My image 1 '
>>> response.css('div a::text').extract_first()
'Name: My image 1 '

#4、属性:xpath的属性加前缀@
>>> response.xpath('//div/a/@href').extract_first()
'image1.html'
>>> response.css('div a::attr(href)').extract_first()
'image1.html'

#4、嵌套查找
>>> response.xpath('//div').css('a').xpath('@href').extract_first()
'image1.html'

#5、设置默认值
>>> response.xpath('//div[@id="xxx"]').extract_first(default="not found")
'not found'

#4、按照属性查找
response.xpath('//div[@id="images"]/a[@href="image3.html"]/text()').extract()
response.css('#images a[@href="image3.html"]/text()').extract()

#5、按照属性模糊查找
response.xpath('//a[contains(@href,"image")]/@href').extract()
response.css('a[href*="image"]::attr(href)').extract()

response.xpath('//a[contains(@href,"image")]/img/@src').extract()
response.css('a[href*="imag"] img::attr(src)').extract()

response.xpath('//*[@href="image1.html"]')
response.css('*[href="image1.html"]')

#6、正则表达式
response.xpath('//a/text()').re(r'Name: (.*)')
response.xpath('//a/text()').re_first(r'Name: (.*)')

#7、xpath相对路径
>>> res=response.xpath('//a[contains(@href,"3")]')[0]
>>> res.xpath('img')
[<Selector xpath='img' data='<img src="image3_thumb.jpg">'>]
>>> res.xpath('./img')
[<Selector xpath='./img' data='<img src="image3_thumb.jpg">'>]
>>> res.xpath('.//img')
[<Selector xpath='.//img' data='<img src="image3_thumb.jpg">'>]
>>> res.xpath('//img') #这就是从头开始扫描
[<Selector xpath='//img' data='<img src="image1_thumb.jpg">'>, <Selector xpath='//img' data='<img src="image2_thumb.jpg">'>, <Selector xpath='//img' data='<img src="image3_thumb.jpg">'>, <Selector xpa
th='//img' data='<img src="image4_thumb.jpg">'>, <Selector xpath='//img' data='<img src="image5_thumb.jpg">'>]

#8、带变量的xpath
>>> response.xpath('//div[@id=$xxx]/a/text()',xxx='images').extract_first()
'Name: My image 1 '
>>> response.xpath('//div[count(a)=$yyy]/@id',yyy=5).extract_first() #求有5个a标签的div的id
'images'

View Code

八 Item Pipeline

4858.com 50

#一:可以写多个Pipeline类
#1、如果优先级高的Pipeline的process_item返回一个值或者None,会自动传给下一个pipline的process_item,
#2、如果只想让第一个Pipeline执行,那得让第一个pipline的process_item抛出异常raise DropItem()

#3、可以用spider.name == '爬虫名' 来控制哪些爬虫用哪些pipeline

二:示范
from scrapy.exceptions import DropItem

class CustomPipeline(object):
    def __init__(self,v):
        self.value = v

    @classmethod
    def from_crawler(cls, crawler):
        """
        Scrapy会先通过getattr判断我们是否自定义了from_crawler,有则调它来完
        成实例化
        """
        val = crawler.settings.getint('MMMM')
        return cls(val)

    def open_spider(self,spider):
        """
        爬虫刚启动时执行一次
        """
        print('000000')

    def close_spider(self,spider):
        """
        爬虫关闭时执行一次
        """
        print('111111')


    def process_item(self, item, spider):
        # 操作并进行持久化

        # return表示会被后续的pipeline继续处理
        return item

        # 表示将item丢弃,不会被后续pipeline处理
        # raise DropItem()

自定义pipeline
4858.com 51

#1、settings.py
HOST="127.0.0.1"
PORT=27017
USER="root"
PWD="123"
DB="amazon"
TABLE="goods"



ITEM_PIPELINES = {
   'Amazon.pipelines.CustomPipeline': 200,
}

#2、pipelines.py
class CustomPipeline(object):
    def __init__(self,host,port,user,pwd,db,table):
        self.host=host
        self.port=port
        self.user=user
        self.pwd=pwd
        self.db=db
        self.table=table

    @classmethod
    def from_crawler(cls, crawler):
        """
        Scrapy会先通过getattr判断我们是否自定义了from_crawler,有则调它来完
        成实例化
        """
        HOST = crawler.settings.get('HOST')
        PORT = crawler.settings.get('PORT')
        USER = crawler.settings.get('USER')
        PWD = crawler.settings.get('PWD')
        DB = crawler.settings.get('DB')
        TABLE = crawler.settings.get('TABLE')
        return cls(HOST,PORT,USER,PWD,DB,TABLE)

    def open_spider(self,spider):
        """
        爬虫刚启动时执行一次
        """
        self.client = MongoClient('mongodb://%s:%[email protected]%s:%s' %(self.user,self.pwd,self.host,self.port))

    def close_spider(self,spider):
        """
        爬虫关闭时执行一次
        """
        self.client.close()


    def process_item(self, item, spider):
        # 操作并进行持久化

        self.client[self.db][self.table].save(dict(item))

示范

九 Dowloader
Middeware

七 Items

九 Dowloader Middeware

下载中间件的用途
    1、在process——request内,自定义下载,不用scrapy的下载
    2、对请求进行二次加工,比如
        设置请求头
        设置cookie
        添加代理
            scrapy自带的代理组件:
                from scrapy.downloadermiddlewares.httpproxy import HttpProxyMiddleware
                from urllib.request import getproxies

4858.com 52

class DownMiddleware1(object):
    def process_request(self, request, spider):
        """
        请求需要被下载时,经过所有下载器中间件的process_request调用
        :param request: 
        :param spider: 
        :return:  
            None,继续后续中间件去下载;
            Response对象,停止process_request的执行,开始执行process_response
            Request对象,停止中间件的执行,将Request重新调度器
            raise IgnoreRequest异常,停止process_request的执行,开始执行process_exception
        """
        pass



    def process_response(self, request, response, spider):
        """
        spider处理完成,返回时调用
        :param response:
        :param result:
        :param spider:
        :return: 
            Response 对象:转交给其他中间件process_response
            Request 对象:停止中间件,request会被重新调度下载
            raise IgnoreRequest 异常:调用Request.errback
        """
        print('response1')
        return response

    def process_exception(self, request, exception, spider):
        """
        当下载处理器(download handler)或 process_request() (下载中间件)抛出异常
        :param response:
        :param exception:
        :param spider:
        :return: 
            None:继续交给后续中间件处理异常;
            Response对象:停止后续process_exception方法
            Request对象:停止中间件,request将会被重新调用下载
        """
        return None

下载器中间件
4858.com 53

#1、与middlewares.py同级目录下新建proxy_handle.py
import requests

def get_proxy():
    return requests.get("http://127.0.0.1:5010/get/").text

def delete_proxy(proxy):
    requests.get("http://127.0.0.1:5010/delete/?proxy={}".format(proxy))



#2、middlewares.py
from Amazon.proxy_handle import get_proxy,delete_proxy

class DownMiddleware1(object):
    def process_request(self, request, spider):
        """
        请求需要被下载时,经过所有下载器中间件的process_request调用
        :param request:
        :param spider:
        :return:
            None,继续后续中间件去下载;
            Response对象,停止process_request的执行,开始执行process_response
            Request对象,停止中间件的执行,将Request重新调度器
            raise IgnoreRequest异常,停止process_request的执行,开始执行process_exception
        """
        proxy="http://" + get_proxy()
        request.meta['download_timeout']=20
        request.meta["proxy"] = proxy
        print('为%s 添加代理%s ' % (request.url, proxy),end='')
        print('元数据为',request.meta)

    def process_response(self, request, response, spider):
        """
        spider处理完成,返回时调用
        :param response:
        :param result:
        :param spider:
        :return:
            Response 对象:转交给其他中间件process_response
            Request 对象:停止中间件,request会被重新调度下载
            raise IgnoreRequest 异常:调用Request.errback
        """
        print('返回状态吗',response.status)
        return response


    def process_exception(self, request, exception, spider):
        """
        当下载处理器(download handler)或 process_request() (下载中间件)抛出异常
        :param response:
        :param exception:
        :param spider:
        :return:
            None:继续交给后续中间件处理异常;
            Response对象:停止后续process_exception方法
            Request对象:停止中间件,request将会被重新调用下载
        """
        print('代理%s,访问%s出现异常:%s' %(request.meta['proxy'],request.url,exception))
        import time
        time.sleep(5)
        delete_proxy(request.meta['proxy'].split("//")[-1])
        request.meta['proxy']='http://'+get_proxy()

        return request

配备代理

十 Spider
Middleware

八 Item Pipeline

4858.com 544858.com 55

#一:可以写多个Pipeline类
#1、如果优先级高的Pipeline的process_item返回一个值或者None,会自动传给下一个pipline的process_item,
#2、如果只想让第一个Pipeline执行,那得让第一个pipline的process_item抛出异常raise DropItem()

#3、可以用spider.name == '爬虫名' 来控制哪些爬虫用哪些pipeline

二:示范
from scrapy.exceptions import DropItem

class CustomPipeline(object):
    def __init__(self,v):
        self.value = v

    @classmethod
    def from_crawler(cls, crawler):
        """
        Scrapy会先通过getattr判断我们是否自定义了from_crawler,有则调它来完
        成实例化
        """
        val = crawler.settings.getint('MMMM')
        return cls(val)

    def open_spider(self,spider):
        """
        爬虫刚启动时执行一次
        """
        print('000000')

    def close_spider(self,spider):
        """
        爬虫关闭时执行一次
        """
        print('111111')


    def process_item(self, item, spider):
        # 操作并进行持久化

        # return表示会被后续的pipeline继续处理
        return item

        # 表示将item丢弃,不会被后续pipeline处理
        # raise DropItem()

自定义pipeline

4858.com 564858.com 57

#1、settings.py
HOST="127.0.0.1"
PORT=27017
USER="root"
PWD="123"
DB="amazon"
TABLE="goods"



ITEM_PIPELINES = {
   'Amazon.pipelines.CustomPipeline': 200,
}

#2、pipelines.py
class CustomPipeline(object):
    def __init__(self,host,port,user,pwd,db,table):
        self.host=host
        self.port=port
        self.user=user
        self.pwd=pwd
        self.db=db
        self.table=table

    @classmethod
    def from_crawler(cls, crawler):
        """
        Scrapy会先通过getattr判断我们是否自定义了from_crawler,有则调它来完
        成实例化
        """
        HOST = crawler.settings.get('HOST')
        PORT = crawler.settings.get('PORT')
        USER = crawler.settings.get('USER')
        PWD = crawler.settings.get('PWD')
        DB = crawler.settings.get('DB')
        TABLE = crawler.settings.get('TABLE')
        return cls(HOST,PORT,USER,PWD,DB,TABLE)

    def open_spider(self,spider):
        """
        爬虫刚启动时执行一次
        """
        self.client = MongoClient('mongodb://%s:%s@%s:%s' %(self.user,self.pwd,self.host,self.port))

    def close_spider(self,spider):
        """
        爬虫关闭时执行一次
        """
        self.client.close()


    def process_item(self, item, spider):
        # 操作并进行持久化

        self.client[self.db][self.table].save(dict(item))

示范

十 Spider Middleware

4858.com 58

class SpiderMiddleware(object):

    def process_spider_input(self,response, spider):
        """
        下载完成,执行,然后交给parse处理
        :param response: 
        :param spider: 
        :return: 
        """
        pass

    def process_spider_output(self,response, result, spider):
        """
        spider处理完成,返回时调用
        :param response:
        :param result:
        :param spider:
        :return: 必须返回包含 Request 或 Item 对象的可迭代对象(iterable)
        """
        return result

    def process_spider_exception(self,response, exception, spider):
        """
        异常调用
        :param response:
        :param exception:
        :param spider:
        :return: None,继续交给后续中间件处理异常;含 Response 或 Item 的可迭代对象(iterable),交给调度器或pipeline
        """
        return None


    def process_start_requests(self,start_requests, spider):
        """
        爬虫启动时调用
        :param start_requests:
        :param spider:
        :return: 包含 Request 对象的可迭代对象
        """
        return start_requests

爬虫中间件

十一
settings.py

九 Dowloader Middeware

4858.com 59

下载中间件的用途
    1、在process——request内,自定义下载,不用scrapy的下载
    2、对请求进行二次加工,比如
        设置请求头
        设置cookie
        添加代理
            scrapy自带的代理组件:
                from scrapy.downloadermiddlewares.httpproxy import HttpProxyMiddleware
                from urllib.request import getproxies

4858.com 60

4858.com 614858.com 62

class DownMiddleware1(object):
    def process_request(self, request, spider):
        """
        请求需要被下载时,经过所有下载器中间件的process_request调用
        :param request: 
        :param spider: 
        :return:  
            None,继续后续中间件去下载;
            Response对象,停止process_request的执行,开始执行process_response
            Request对象,停止中间件的执行,将Request重新调度器
            raise IgnoreRequest异常,停止process_request的执行,开始执行process_exception
        """
        pass



    def process_response(self, request, response, spider):
        """
        spider处理完成,返回时调用
        :param response:
        :param result:
        :param spider:
        :return: 
            Response 对象:转交给其他中间件process_response
            Request 对象:停止中间件,request会被重新调度下载
            raise IgnoreRequest 异常:调用Request.errback
        """
        print('response1')
        return response

    def process_exception(self, request, exception, spider):
        """
        当下载处理器(download handler)或 process_request() (下载中间件)抛出异常
        :param response:
        :param exception:
        :param spider:
        :return: 
            None:继续交给后续中间件处理异常;
            Response对象:停止后续process_exception方法
            Request对象:停止中间件,request将会被重新调用下载
        """
        return None

下载器中间件

4858.com 634858.com 64

#1、与middlewares.py同级目录下新建proxy_handle.py
import requests

def get_proxy():
    return requests.get("http://127.0.0.1:5010/get/").text

def delete_proxy(proxy):
    requests.get("http://127.0.0.1:5010/delete/?proxy={}".format(proxy))



#2、middlewares.py
from Amazon.proxy_handle import get_proxy,delete_proxy

class DownMiddleware1(object):
    def process_request(self, request, spider):
        """
        请求需要被下载时,经过所有下载器中间件的process_request调用
        :param request:
        :param spider:
        :return:
            None,继续后续中间件去下载;
            Response对象,停止process_request的执行,开始执行process_response
            Request对象,停止中间件的执行,将Request重新调度器
            raise IgnoreRequest异常,停止process_request的执行,开始执行process_exception
        """
        proxy="http://" + get_proxy()
        request.meta['download_timeout']=20
        request.meta["proxy"] = proxy
        print('为%s 添加代理%s ' % (request.url, proxy),end='')
        print('元数据为',request.meta)

    def process_response(self, request, response, spider):
        """
        spider处理完成,返回时调用
        :param response:
        :param result:
        :param spider:
        :return:
            Response 对象:转交给其他中间件process_response
            Request 对象:停止中间件,request会被重新调度下载
            raise IgnoreRequest 异常:调用Request.errback
        """
        print('返回状态吗',response.status)
        return response


    def process_exception(self, request, exception, spider):
        """
        当下载处理器(download handler)或 process_request() (下载中间件)抛出异常
        :param response:
        :param exception:
        :param spider:
        :return:
            None:继续交给后续中间件处理异常;
            Response对象:停止后续process_exception方法
            Request对象:停止中间件,request将会被重新调用下载
        """
        print('代理%s,访问%s出现异常:%s' %(request.meta['proxy'],request.url,exception))
        import time
        time.sleep(5)
        delete_proxy(request.meta['proxy'].split("//")[-1])
        request.meta['proxy']='http://'+get_proxy()

        return request

布署代理

十一 自定义扩大

自定义扩展(与django的信号类似)
    1、django的信号是django是预留的扩展,信号一旦被触发,相应的功能就会执行
    2、scrapy自定义扩展的好处是可以在任意我们想要的位置添加功能,而其他组件中提供的功能只能在规定的位置执行

4858.com 65

#1、在与settings同级目录下新建一个文件,文件名可以为extentions.py,内容如下
from scrapy import signals


class MyExtension(object):
    def __init__(self, value):
        self.value = value

    @classmethod
    def from_crawler(cls, crawler):
        val = crawler.settings.getint('MMMM')
        obj = cls(val)

        crawler.signals.connect(obj.spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(obj.spider_closed, signal=signals.spider_closed)

        return obj

    def spider_opened(self, spider):
        print('=============>open')

    def spider_closed(self, spider):
        print('=============>close')

#2、配置生效
EXTENSIONS = {
    "Amazon.extentions.MyExtension":200
}

View Code

十二
爬取亚马逊商品新闻

十 Spider Middleware

4858.com 664858.com 67

class SpiderMiddleware(object):

    def process_spider_input(self,response, spider):
        """
        下载完成,执行,然后交给parse处理
        :param response: 
        :param spider: 
        :return: 
        """
        pass

    def process_spider_output(self,response, result, spider):
        """
        spider处理完成,返回时调用
        :param response:
        :param result:
        :param spider:
        :return: 必须返回包含 Request 或 Item 对象的可迭代对象(iterable)
        """
        return result

    def process_spider_exception(self,response, exception, spider):
        """
        异常调用:在process_spider_input抛出异常时调用
        :param response:
        :param exception:
        :param spider:
        :return: None,继续交给后续中间件处理异常;含 Response 或 Item 的可迭代对象(iterable),交给调度器或pipeline
        """
        return None


    def process_start_requests(self,start_requests, spider):
        """
        爬虫启动时调用
        :param start_requests:
        :param spider:
        :return: 包含 Request 对象的可迭代对象
        """
        return start_requests

爬虫中间件

十二 settings.py

4858.com 68

# -*- coding: utf-8 -*-

# Scrapy settings for step8_king project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

# 1. 爬虫名称
BOT_NAME = 'step8_king'

# 2. 爬虫应用路径
SPIDER_MODULES = ['step8_king.spiders']
NEWSPIDER_MODULE = 'step8_king.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
# 3. 客户端 user-agent请求头
# USER_AGENT = 'step8_king (+http://www.yourdomain.com)'

# Obey robots.txt rules
# 4. 禁止爬虫配置
# ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
# 5. 并发请求数
# CONCURRENT_REQUESTS = 4

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# 6. 延迟下载秒数
# DOWNLOAD_DELAY = 2


# The download delay setting will honor only one of:
# 7. 单域名访问并发数,并且延迟下次秒数也应用在每个域名
# CONCURRENT_REQUESTS_PER_DOMAIN = 2
# 单IP访问并发数,如果有值则忽略:CONCURRENT_REQUESTS_PER_DOMAIN,并且延迟下次秒数也应用在每个IP
# CONCURRENT_REQUESTS_PER_IP = 3

# Disable cookies (enabled by default)
# 8. 是否支持cookie,cookiejar进行操作cookie
# COOKIES_ENABLED = True
# COOKIES_DEBUG = True

# Disable Telnet Console (enabled by default)
# 9. Telnet用于查看当前爬虫的信息,操作爬虫等...
#    使用telnet ip port ,然后通过命令操作
# TELNETCONSOLE_ENABLED = True
# TELNETCONSOLE_HOST = '127.0.0.1'
# TELNETCONSOLE_PORT = [6023,]


# 10. 默认请求头
# Override the default request headers:
# DEFAULT_REQUEST_HEADERS = {
#     'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#     'Accept-Language': 'en',
# }


# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
# 11. 定义pipeline处理请求
# ITEM_PIPELINES = {
#    'step8_king.pipelines.JsonPipeline': 700,
#    'step8_king.pipelines.FilePipeline': 500,
# }



# 12. 自定义扩展,基于信号进行调用
# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
# EXTENSIONS = {
#     # 'step8_king.extensions.MyExtension': 500,
# }


# 13. 爬虫允许的最大深度,可以通过meta查看当前深度;0表示无深度
# DEPTH_LIMIT = 3

# 14. 爬取时,0表示深度优先Lifo(默认);1表示广度优先FiFo

# 后进先出,深度优先
# DEPTH_PRIORITY = 0
# SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleLifoDiskQueue'
# SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.LifoMemoryQueue'
# 先进先出,广度优先

# DEPTH_PRIORITY = 1
# SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
# SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'

# 15. 调度器队列
# SCHEDULER = 'scrapy.core.scheduler.Scheduler'
# from scrapy.core.scheduler import Scheduler


# 16. 访问URL去重
# DUPEFILTER_CLASS = 'step8_king.duplication.RepeatUrl'


# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html

"""
17. 自动限速算法
    from scrapy.contrib.throttle import AutoThrottle
    自动限速设置
    1. 获取最小延迟 DOWNLOAD_DELAY
    2. 获取最大延迟 AUTOTHROTTLE_MAX_DELAY
    3. 设置初始下载延迟 AUTOTHROTTLE_START_DELAY
    4. 当请求下载完成后,获取其"连接"时间 latency,即:请求连接到接受到响应头之间的时间
    5. 用于计算的... AUTOTHROTTLE_TARGET_CONCURRENCY
    target_delay = latency / self.target_concurrency
    new_delay = (slot.delay + target_delay) / 2.0 # 表示上一次的延迟时间
    new_delay = max(target_delay, new_delay)
    new_delay = min(max(self.mindelay, new_delay), self.maxdelay)
    slot.delay = new_delay
"""

# 开始自动限速
# AUTOTHROTTLE_ENABLED = True
# The initial download delay
# 初始下载延迟
# AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
# 最大下载延迟
# AUTOTHROTTLE_MAX_DELAY = 10
# The average number of requests Scrapy should be sending in parallel to each remote server
# 平均每秒并发数
# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every response received:
# 是否显示
# AUTOTHROTTLE_DEBUG = True

# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings


"""
18. 启用缓存
    目的用于将已经发送的请求或相应缓存下来,以便以后使用

    from scrapy.downloadermiddlewares.httpcache import HttpCacheMiddleware
    from scrapy.extensions.httpcache import DummyPolicy
    from scrapy.extensions.httpcache import FilesystemCacheStorage
"""
# 是否启用缓存策略
# HTTPCACHE_ENABLED = True

# 缓存策略:所有请求均缓存,下次在请求直接访问原来的缓存即可
# HTTPCACHE_POLICY = "scrapy.extensions.httpcache.DummyPolicy"
# 缓存策略:根据Http响应头:Cache-Control、Last-Modified 等进行缓存的策略
# HTTPCACHE_POLICY = "scrapy.extensions.httpcache.RFC2616Policy"

# 缓存超时时间
# HTTPCACHE_EXPIRATION_SECS = 0

# 缓存保存路径
# HTTPCACHE_DIR = 'httpcache'

# 缓存忽略的Http状态码
# HTTPCACHE_IGNORE_HTTP_CODES = []

# 缓存存储的插件
# HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'


"""
19. 代理,需要在环境变量中设置
    from scrapy.contrib.downloadermiddleware.httpproxy import HttpProxyMiddleware

    方式一:使用默认
        os.environ
        {
            http_proxy:http://root:[email protected]:9999/
            https_proxy:http://192.168.11.11:9999/
        }
    方式二:使用自定义下载中间件

    def to_bytes(text, encoding=None, errors='strict'):
        if isinstance(text, bytes):
            return text
        if not isinstance(text, six.string_types):
            raise TypeError('to_bytes must receive a unicode, str or bytes '
                            'object, got %s' % type(text).__name__)
        if encoding is None:
            encoding = 'utf-8'
        return text.encode(encoding, errors)

    class ProxyMiddleware(object):
        def process_request(self, request, spider):
            PROXIES = [
                {'ip_port': '111.11.228.75:80', 'user_pass': ''},
                {'ip_port': '120.198.243.22:80', 'user_pass': ''},
                {'ip_port': '111.8.60.9:8123', 'user_pass': ''},
                {'ip_port': '101.71.27.120:80', 'user_pass': ''},
                {'ip_port': '122.96.59.104:80', 'user_pass': ''},
                {'ip_port': '122.224.249.122:8088', 'user_pass': ''},
            ]
            proxy = random.choice(PROXIES)
            if proxy['user_pass'] is not None:
                request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])
                encoded_user_pass = base64.encodestring(to_bytes(proxy['user_pass']))
                request.headers['Proxy-Authorization'] = to_bytes('Basic ' + encoded_user_pass)
                print "**************ProxyMiddleware have pass************" + proxy['ip_port']
            else:
                print "**************ProxyMiddleware no pass************" + proxy['ip_port']
                request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])

    DOWNLOADER_MIDDLEWARES = {
       'step8_king.middlewares.ProxyMiddleware': 500,
    }

"""

"""
20. Https访问
    Https访问时有两种情况:
    1. 要爬取网站使用的可信任证书(默认支持)
        DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory"
        DOWNLOADER_CLIENTCONTEXTFACTORY = "scrapy.core.downloader.contextfactory.ScrapyClientContextFactory"

    2. 要爬取网站使用的自定义证书
        DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory"
        DOWNLOADER_CLIENTCONTEXTFACTORY = "step8_king.https.MySSLFactory"

        # https.py
        from scrapy.core.downloader.contextfactory import ScrapyClientContextFactory
        from twisted.internet.ssl import (optionsForClientTLS, CertificateOptions, PrivateCertificate)

        class MySSLFactory(ScrapyClientContextFactory):
            def getCertificateOptions(self):
                from OpenSSL import crypto
                v1 = crypto.load_privatekey(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.key.unsecure', mode='r').read())
                v2 = crypto.load_certificate(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.pem', mode='r').read())
                return CertificateOptions(
                    privateKey=v1,  # pKey对象
                    certificate=v2,  # X509对象
                    verify=False,
                    method=getattr(self, 'method', getattr(self, '_ssl_method', None))
                )
    其他:
        相关类
            scrapy.core.downloader.handlers.http.HttpDownloadHandler
            scrapy.core.downloader.webclient.ScrapyHTTPClientFactory
            scrapy.core.downloader.contextfactory.ScrapyClientContextFactory
        相关配置
            DOWNLOADER_HTTPCLIENTFACTORY
            DOWNLOADER_CLIENTCONTEXTFACTORY

"""



"""
21. 爬虫中间件
    class SpiderMiddleware(object):

        def process_spider_input(self,response, spider):
            '''
            下载完成,执行,然后交给parse处理
            :param response: 
            :param spider: 
            :return: 
            '''
            pass

        def process_spider_output(self,response, result, spider):
            '''
            spider处理完成,返回时调用
            :param response:
            :param result:
            :param spider:
            :return: 必须返回包含 Request 或 Item 对象的可迭代对象(iterable)
            '''
            return result

        def process_spider_exception(self,response, exception, spider):
            '''
            异常调用
            :param response:
            :param exception:
            :param spider:
            :return: None,继续交给后续中间件处理异常;含 Response 或 Item 的可迭代对象(iterable),交给调度器或pipeline
            '''
            return None


        def process_start_requests(self,start_requests, spider):
            '''
            爬虫启动时调用
            :param start_requests:
            :param spider:
            :return: 包含 Request 对象的可迭代对象
            '''
            return start_requests

    内置爬虫中间件:
        'scrapy.contrib.spidermiddleware.httperror.HttpErrorMiddleware': 50,
        'scrapy.contrib.spidermiddleware.offsite.OffsiteMiddleware': 500,
        'scrapy.contrib.spidermiddleware.referer.RefererMiddleware': 700,
        'scrapy.contrib.spidermiddleware.urllength.UrlLengthMiddleware': 800,
        'scrapy.contrib.spidermiddleware.depth.DepthMiddleware': 900,

"""
# from scrapy.contrib.spidermiddleware.referer import RefererMiddleware
# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
   # 'step8_king.middlewares.SpiderMiddleware': 543,
}


"""
22. 下载中间件
    class DownMiddleware1(object):
        def process_request(self, request, spider):
            '''
            请求需要被下载时,经过所有下载器中间件的process_request调用
            :param request:
            :param spider:
            :return:
                None,继续后续中间件去下载;
                Response对象,停止process_request的执行,开始执行process_response
                Request对象,停止中间件的执行,将Request重新调度器
                raise IgnoreRequest异常,停止process_request的执行,开始执行process_exception
            '''
            pass



        def process_response(self, request, response, spider):
            '''
            spider处理完成,返回时调用
            :param response:
            :param result:
            :param spider:
            :return:
                Response 对象:转交给其他中间件process_response
                Request 对象:停止中间件,request会被重新调度下载
                raise IgnoreRequest 异常:调用Request.errback
            '''
            print('response1')
            return response

        def process_exception(self, request, exception, spider):
            '''
            当下载处理器(download handler)或 process_request() (下载中间件)抛出异常
            :param response:
            :param exception:
            :param spider:
            :return:
                None:继续交给后续中间件处理异常;
                Response对象:停止后续process_exception方法
                Request对象:停止中间件,request将会被重新调用下载
            '''
            return None


    默认下载中间件
    {
        'scrapy.contrib.downloadermiddleware.robotstxt.RobotsTxtMiddleware': 100,
        'scrapy.contrib.downloadermiddleware.httpauth.HttpAuthMiddleware': 300,
        'scrapy.contrib.downloadermiddleware.downloadtimeout.DownloadTimeoutMiddleware': 350,
        'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': 400,
        'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 500,
        'scrapy.contrib.downloadermiddleware.defaultheaders.DefaultHeadersMiddleware': 550,
        'scrapy.contrib.downloadermiddleware.redirect.MetaRefreshMiddleware': 580,
        'scrapy.contrib.downloadermiddleware.httpcompression.HttpCompressionMiddleware': 590,
        'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': 600,
        'scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware': 700,
        'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 750,
        'scrapy.contrib.downloadermiddleware.chunked.ChunkedTransferMiddleware': 830,
        'scrapy.contrib.downloadermiddleware.stats.DownloaderStats': 850,
        'scrapy.contrib.downloadermiddleware.httpcache.HttpCacheMiddleware': 900,
    }

"""
# from scrapy.contrib.downloadermiddleware.httpauth import HttpAuthMiddleware
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# DOWNLOADER_MIDDLEWARES = {
#    'step8_king.middlewares.DownMiddleware1': 100,
#    'step8_king.middlewares.DownMiddleware2': 500,
# }

settings.py

一 介绍
Scrapy贰个开源和搭档的框架,其刚开始阶段是为了页面抓取 (更适用来讲, 网络抓取
)所安顿的,使用它…

一介绍

十一 自定义扩充

自定义扩展(与django的信号类似)
    1、django的信号是django是预留的扩展,信号一旦被触发,相应的功能就会执行
    2、scrapy自定义扩展的好处是可以在任意我们想要的位置添加功能,而其他组件中提供的功能只能在规定的位置执行

4858.com 694858.com 70

#1、在与settings同级目录下新建一个文件,文件名可以为extentions.py,内容如下
from scrapy import signals


class MyExtension(object):
    def __init__(self, value):
        self.value = value

    @classmethod
    def from_crawler(cls, crawler):
        val = crawler.settings.getint('MMMM')
        obj = cls(val)

        crawler.signals.connect(obj.spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(obj.spider_closed, signal=signals.spider_closed)

        return obj

    def spider_opened(self, spider):
        print('=============>open')

    def spider_closed(self, spider):
        print('=============>close')

#2、配置生效
EXTENSIONS = {
    "Amazon.extentions.MyExtension":200
}

View Code

Scrapy二个开源和同盟的框架,其开始时代是为了页面抓取 (更伏贴来说, 互连网抓取
)所设计的,使用它能够以便捷、简单、可扩张的不二法门从网址中提取所需的多少。但当下Scrapy的用处丰硕普及,可用以如数据开采、监测和自动化测验等世界,也得以动用在获取API所重回的数量(举例亚马逊 Associates Web Services ) 可能通用的网络爬虫。

十二 settings.py

4858.com 714858.com 72

#==>第一部分:基本配置<===
#1、项目名称,默认的USER_AGENT由它来构成,也作为日志记录的日志名
BOT_NAME = 'Amazon'

#2、爬虫应用路径
SPIDER_MODULES = ['Amazon.spiders']
NEWSPIDER_MODULE = 'Amazon.spiders'

#3、客户端User-Agent请求头
#USER_AGENT = 'Amazon (+http://www.yourdomain.com)'

#4、是否遵循爬虫协议
# Obey robots.txt rules
ROBOTSTXT_OBEY = False

#5、是否支持cookie,cookiejar进行操作cookie,默认开启
#COOKIES_ENABLED = False

#6、Telnet用于查看当前爬虫的信息,操作爬虫等...使用telnet ip port ,然后通过命令操作
#TELNETCONSOLE_ENABLED = False
#TELNETCONSOLE_HOST = '127.0.0.1'
#TELNETCONSOLE_PORT = [6023,]

#7、Scrapy发送HTTP请求默认使用的请求头
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}



#===>第二部分:并发与延迟<===
#1、下载器总共最大处理的并发请求数,默认值16
#CONCURRENT_REQUESTS = 32

#2、每个域名能够被执行的最大并发请求数目,默认值8
#CONCURRENT_REQUESTS_PER_DOMAIN = 16

#3、能够被单个IP处理的并发请求数,默认值0,代表无限制,需要注意两点
#I、如果不为零,那CONCURRENT_REQUESTS_PER_DOMAIN将被忽略,即并发数的限制是按照每个IP来计算,而不是每个域名
#II、该设置也影响DOWNLOAD_DELAY,如果该值不为零,那么DOWNLOAD_DELAY下载延迟是限制每个IP而不是每个域
#CONCURRENT_REQUESTS_PER_IP = 16

#4、如果没有开启智能限速,这个值就代表一个规定死的值,代表对同一网址延迟请求的秒数
#DOWNLOAD_DELAY = 3


#===>第三部分:智能限速/自动节流:AutoThrottle extension<===
#一:介绍
from scrapy.contrib.throttle import AutoThrottle #http://scrapy.readthedocs.io/en/latest/topics/autothrottle.html#topics-autothrottle
设置目标:
1、比使用默认的下载延迟对站点更好
2、自动调整scrapy到最佳的爬取速度,所以用户无需自己调整下载延迟到最佳状态。用户只需要定义允许最大并发的请求,剩下的事情由该扩展组件自动完成


#二:如何实现?
在Scrapy中,下载延迟是通过计算建立TCP连接到接收到HTTP包头(header)之间的时间来测量的。
注意,由于Scrapy可能在忙着处理spider的回调函数或者无法下载,因此在合作的多任务环境下准确测量这些延迟是十分苦难的。 不过,这些延迟仍然是对Scrapy(甚至是服务器)繁忙程度的合理测量,而这扩展就是以此为前提进行编写的。


#三:限速算法
自动限速算法基于以下规则调整下载延迟
#1、spiders开始时的下载延迟是基于AUTOTHROTTLE_START_DELAY的值
#2、当收到一个response,对目标站点的下载延迟=收到响应的延迟时间/AUTOTHROTTLE_TARGET_CONCURRENCY
#3、下一次请求的下载延迟就被设置成:对目标站点下载延迟时间和过去的下载延迟时间的平均值
#4、没有达到200个response则不允许降低延迟
#5、下载延迟不能变的比DOWNLOAD_DELAY更低或者比AUTOTHROTTLE_MAX_DELAY更高

#四:配置使用
#开启True,默认False
AUTOTHROTTLE_ENABLED = True
#起始的延迟
AUTOTHROTTLE_START_DELAY = 5
#最小延迟
DOWNLOAD_DELAY = 3
#最大延迟
AUTOTHROTTLE_MAX_DELAY = 10
#每秒并发请求数的平均值,不能高于 CONCURRENT_REQUESTS_PER_DOMAIN或CONCURRENT_REQUESTS_PER_IP,调高了则吞吐量增大强奸目标站点,调低了则对目标站点更加”礼貌“
#每个特定的时间点,scrapy并发请求的数目都可能高于或低于该值,这是爬虫视图达到的建议值而不是硬限制
AUTOTHROTTLE_TARGET_CONCURRENCY = 16.0
#调试
AUTOTHROTTLE_DEBUG = True
CONCURRENT_REQUESTS_PER_DOMAIN = 16
CONCURRENT_REQUESTS_PER_IP = 16



#===>第四部分:爬取深度与爬取方式<===
#1、爬虫允许的最大深度,可以通过meta查看当前深度;0表示无深度
# DEPTH_LIMIT = 3

#2、爬取时,0表示深度优先Lifo(默认);1表示广度优先FiFo

# 后进先出,深度优先
# DEPTH_PRIORITY = 0
# SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleLifoDiskQueue'
# SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.LifoMemoryQueue'
# 先进先出,广度优先

# DEPTH_PRIORITY = 1
# SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
# SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'


#3、调度器队列
# SCHEDULER = 'scrapy.core.scheduler.Scheduler'
# from scrapy.core.scheduler import Scheduler

#4、访问URL去重
# DUPEFILTER_CLASS = 'step8_king.duplication.RepeatUrl'



#===>第五部分:中间件、Pipelines、扩展<===
#1、Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'Amazon.middlewares.AmazonSpiderMiddleware': 543,
#}

#2、Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
   # 'Amazon.middlewares.DownMiddleware1': 543,
}

#3、Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

#4、Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   # 'Amazon.pipelines.CustomPipeline': 200,
}



#===>第六部分:缓存<===
"""
1. 启用缓存
    目的用于将已经发送的请求或相应缓存下来,以便以后使用

    from scrapy.downloadermiddlewares.httpcache import HttpCacheMiddleware
    from scrapy.extensions.httpcache import DummyPolicy
    from scrapy.extensions.httpcache import FilesystemCacheStorage
"""
# 是否启用缓存策略
# HTTPCACHE_ENABLED = True

# 缓存策略:所有请求均缓存,下次在请求直接访问原来的缓存即可
# HTTPCACHE_POLICY = "scrapy.extensions.httpcache.DummyPolicy"
# 缓存策略:根据Http响应头:Cache-Control、Last-Modified 等进行缓存的策略
# HTTPCACHE_POLICY = "scrapy.extensions.httpcache.RFC2616Policy"

# 缓存超时时间
# HTTPCACHE_EXPIRATION_SECS = 0

# 缓存保存路径
# HTTPCACHE_DIR = 'httpcache'

# 缓存忽略的Http状态码
# HTTPCACHE_IGNORE_HTTP_CODES = []

# 缓存存储的插件
# HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'


#===>第七部分:线程池<===
REACTOR_THREADPOOL_MAXSIZE = 10

#Default: 10
#scrapy基于twisted异步IO框架,downloader是多线程的,线程数是Twisted线程池的默认大小(The maximum limit for Twisted Reactor thread pool size.)

#关于twisted线程池:
http://twistedmatrix.com/documents/10.1.0/core/howto/threading.html

#线程池实现:twisted.python.threadpool.ThreadPool
twisted调整线程池大小:
from twisted.internet import reactor
reactor.suggestThreadPoolSize(30)

#scrapy相关源码:
D:\python3.6\Lib\site-packages\scrapy\crawler.py

#补充:
windows下查看进程内线程数的工具:
    https://docs.microsoft.com/zh-cn/sysinternals/downloads/pslist
    或
    https://pan.baidu.com/s/1jJ0pMaM

    命令为:
    pslist |findstr python

linux下:top -p 进程id


#===>第八部分:其他默认配置参考<===
D:\python3.6\Lib\site-packages\scrapy\settings\default_settings.py

settings.py

    Scrapy
是基于twisted框架开荒而来,twisted是两个风行的事件驱动的python互连网框架。因而Scrapy使用了一种非阻塞(又名异步)的代码来达成产出。全部架构大概如下:

十三 爬取亚马逊(亚马逊(Amazon))商品消息

链接:https://pan.baidu.com/s/1eSWBDh4

4858.com 73

The data flow in Scrapy is controlled by the execution engine, and
goes like this:

  1. The Engine gets
    the initial Requests to crawl from
    the Spider.
  2. The Engine schedules
    the Requests in
    the Scheduler and
    asks for the next Requests to crawl.
  3. The Scheduler returns
    the next Requests to
    the Engine.
  4. The Engine sends
    the Requests to
    the Downloader,
    passing through the Downloader
    Middlewares (see process_request()).
  5. Once the page finishes downloading
    the Downloader generates
    a Response (with that page) and sends it to the Engine, passing
    through the Downloader
    Middlewares (see process_response()).
  6. The Engine receives
    the Response from
    the Downloader and
    sends it to
    the Spider for
    processing, passing through the Spider
    Middleware (see process_spider_input()).
  7. The Spider processes
    the Response and returns scraped items and new Requests (to follow)
    to
    the Engine,
    passing through the Spider
    Middleware (see process_spider_output()).
  8. The Engine sends
    processed items to Item
    Pipelines,
    then send processed Requests to
    the Scheduler and
    asks for possible next Requests to crawl.
  9. The process repeats (from step 1) until there are no more requests
    from
    the Scheduler.

 

 

Components:

引擎(EGINE)
引擎负责控制系统所有组件之间的数据流,并在某些动作发生时触发事件。有关详细信息,请参见上面的数据流部分。

调度器(SCHEDULER)
用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL的优先级队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址
下载器(DOWLOADER)
用于下载网页内容, 并将网页内容返回给EGINE,下载器是建立在twisted这个高效的异步模型上的
爬虫(SPIDERS)
SPIDERS是开发人员自定义的类,用来解析responses,并且提取items,或者发送新的请求
项目管道(ITEM PIPLINES)
在items被提取后负责处理它们,主要包括清理、验证、持久化(比如存到数据库)等操作
下载器中间件(Downloader Middlewares)
位于Scrapy引擎和下载器之间,主要用来处理从EGINE传到DOWLOADER的请求request,已经从DOWNLOADER传到EGINE的响应response,你可用该中间件做以下几件事
process a request just before it is sent to the Downloader (i.e. right before Scrapy sends the request to the website);
change received response before passing it to a spider;
send a new Request instead of passing received response to a spider;
pass response to a spider without fetching a web page;
silently drop some requests.
爬虫中间件(Spider Middlewares)
位于EGINE和SPIDERS之间,主要工作是处理SPIDERS的输入(即responses)和输出(即requests)

官方网站链接:https://docs.scrapy.org/en/latest/topics/architecture.html

二安装

#Windows平台
    1、pip3 install wheel #安装后,便支持通过wheel文件安装软件,wheel文件官网:https://www.lfd.uci.edu/~gohlke/pythonlibs
    3、pip3 install lxml
    4、pip3 install pyopenssl
    5、下载并安装pywin32:https://sourceforge.net/projects/pywin32/files/pywin32/
    6、下载twisted的wheel文件:http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
    7、执行pip3 install 下载目录\Twisted-17.9.0-cp36-cp36m-win_amd64.whl
    8、pip3 install scrapy

#Linux平台
    1、pip3 install scrapy

 

三 命令行工具

#1 查看帮助
    scrapy -h
    scrapy <command> -h

#2 有两种命令:其中Project-only必须切到项目文件夹下才能执行,而Global的命令则不需要
    Global commands:
        startproject #创建项目
        genspider    #创建爬虫程序
        settings     #如果是在项目目录下,则得到的是该项目的配置
        runspider    #运行一个独立的python文件,不必创建项目
        shell        #scrapy shell url地址  在交互式调试,如选择器规则正确与否
        fetch        #独立于程单纯地爬取一个页面,可以拿到请求头
        view         #下载完毕后直接弹出浏览器,以此可以分辨出哪些数据是ajax请求
        version      #scrapy version 查看scrapy的版本,scrapy version -v查看scrapy依赖库的版本
    Project-only commands:
        crawl        #运行爬虫,必须创建项目才行,确保配置文件中ROBOTSTXT_OBEY = False
        check        #检测项目中有无语法错误
        list         #列出项目中所包含的爬虫名
        edit         #编辑器,一般不用
        parse        #scrapy parse url地址 --callback 回调函数  #以此可以验证我们的回调函数是否正确
        bench        #scrapy bentch压力测试

#3 官网链接
    https://docs.scrapy.org/en/latest/topics/commands.html

4858.com 744858.com 75

#1、执行全局命令:请确保不在某个项目的目录下,排除受该项目配置的影响
scrapy startproject MyProject

cd MyProject
scrapy genspider baidu www.baidu.com

scrapy settings --get XXX #如果切换到项目目录下,看到的则是该项目的配置

scrapy runspider baidu.py

scrapy shell https://www.baidu.com
    response
    response.status
    response.body
    view(response)

scrapy view https://www.taobao.com #如果页面显示内容不全,不全的内容则是ajax请求实现的,以此快速定位问题

scrapy fetch --nolog --headers https://www.taobao.com

scrapy version #scrapy的版本

scrapy version -v #依赖库的版本


#2、执行项目命令:切到项目目录下
scrapy crawl baidu
scrapy check
scrapy list
scrapy parse http://quotes.toscrape.com/ --callback parse
scrapy bench

示范用法

四系列布局以及爬虫应用简介

project_name/
   scrapy.cfg
   project_name/
       __init__.py
       items.py
       pipelines.py
       settings.py
       spiders/
           __init__.py
           爬虫1.py
           爬虫2.py
           爬虫3.py

文本表达:

  • scrapy.cfg
     项目标主配置音讯,用来配置scrapy时利用,爬虫相关的安插信息在settings.py文件中。
  • items.py    设置数据存款和储蓄模板,用于结构化数据,如:Django的Model
  • pipelines    数据管理作为,如:一般结构化的多少持久化
  • settings.py
    配置文件,如:递归的层数、并发数,延迟下载等。重申:配置文件的选项必得大写不然视为无效,精确写法USEEvoque_AGENT=’xxxx’
  • spiders      爬虫目录,如:创立文件,编写爬虫法则

留神:一般成立爬虫文件时,以网址域名命名

4858.com 764858.com 77

#在项目目录下新建:entrypoint.py
from scrapy.cmdline import execute
execute(['scrapy', 'crawl', 'xiaohua'])

只好在cmd中进行爬虫,假若不想就依照上边包车型地铁陈设就行啊

4858.com 784858.com 79

import sys,os
sys.stdout=io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030')

至于windows编码难点

五Spiders

1、介绍

#1、Spiders是由一系列类(定义了一个网址或一组网址将被爬取)组成,具体包括如何执行爬取任务并且如何从页面中提取结构化的数据。

#2、换句话说,Spiders是你为了一个特定的网址或一组网址自定义爬取和解析页面行为的地方

2、Spiders会循环做如下事情

#1、生成初始的Requests来爬取第一个URLS,并且标识一个回调函数
第一个请求定义在start_requests()方法内默认从start_urls列表中获得url地址来生成Request请求,默认的回调函数是parse方法。回调函数在下载完成返回response时自动触发

#2、在回调函数中,解析response并且返回值
返回值可以4种:
        包含解析数据的字典
        Item对象
        新的Request对象(新的Requests也需要指定一个回调函数)
        或者是可迭代对象(包含Items或Request)

#3、在回调函数中解析页面内容
通常使用Scrapy自带的Selectors,但很明显你也可以使用Beutifulsoup,lxml或其他你爱用啥用啥。

#4、最后,针对返回的Items对象将会被持久化到数据库
通过Item Pipeline组件存到数据库:https://docs.scrapy.org/en/latest/topics/item-pipeline.html#topics-item-pipeline)
或者导出到不同的文件(通过Feed exports:https://docs.scrapy.org/en/latest/topics/feed-exports.html#topics-feed-exports)

4858.com ,3、Spiders总共提供了五档案的次序:

#1、scrapy.spiders.Spider #scrapy.Spider等同于scrapy.spiders.Spider
#2、scrapy.spiders.CrawlSpider
#3、scrapy.spiders.XMLFeedSpider
#4、scrapy.spiders.CSVFeedSpider
#5、scrapy.spiders.SitemapSpider

4、导入使用

# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import Spider,CrawlSpider,XMLFeedSpider,CSVFeedSpider,SitemapSpider

class AmazonSpider(scrapy.Spider): #自定义类,继承Spiders提供的基类
    name = 'amazon'
    allowed_domains = ['www.amazon.cn']
    start_urls = ['http://www.amazon.cn/']

    def parse(self, response):
        pass

5、class scrapy.spiders.Spider

那是最轻易易行的spider类,任何其余的spider类都要求继续它(包含你本身定义的)。

该类不提供其余非常的效劳,它仅提供了三个暗许的start_requests方法暗中同意从start_urls中读取url地址发送requests央浼,而且暗中认可parse作为回调函数

class AmazonSpider(scrapy.Spider):
    name = 'amazon' 

    allowed_domains = ['www.amazon.cn'] 

    start_urls = ['http://www.amazon.cn/']

    custom_settings = {
        'BOT_NAME' : 'Egon_Spider_Amazon',
        'REQUEST_HEADERS' : {
          'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
          'Accept-Language': 'en',
        }
    }

    def parse(self, response):
        pass

4858.com 804858.com 81

#1、name = 'amazon' 
定义爬虫名,scrapy会根据该值定位爬虫程序
所以它必须要有且必须唯一(In Python 2 this must be ASCII only.)

#2、allowed_domains = ['www.amazon.cn'] 
定义允许爬取的域名,如果OffsiteMiddleware启动(默认就启动),
那么不属于该列表的域名及其子域名都不允许爬取
如果爬取的网址为:https://www.example.com/1.html,那就添加'example.com'到列表.

#3、start_urls = ['http://www.amazon.cn/']
如果没有指定url,就从该列表中读取url来生成第一个请求

#4、custom_settings
值为一个字典,定义一些配置信息,在运行爬虫程序时,这些配置会覆盖项目级别的配置
所以custom_settings必须被定义成一个类属性,由于settings会在类实例化前被加载

#5、settings
通过self.settings['配置项的名字']可以访问settings.py中的配置,如果自己定义了custom_settings还是以自己的为准

#6、logger
日志名默认为spider的名字
self.logger.debug('=============>%s' %self.settings['BOT_NAME'])

#5、crawler:了解
该属性必须被定义到类方法from_crawler中

#6、from_crawler(crawler, *args, **kwargs):了解
You probably won’t need to override this directly  because the default implementation acts as a proxy to the __init__() method, calling it with the given arguments args and named arguments kwargs.

#7、start_requests()
该方法用来发起第一个Requests请求,且必须返回一个可迭代的对象。它在爬虫程序打开时就被Scrapy调用,Scrapy只调用它一次。
默认从start_urls里取出每个url来生成Request(url, dont_filter=True)

#针对参数dont_filter,请看自定义去重规则

如果你想要改变起始爬取的Requests,你就需要覆盖这个方法,例如你想要起始发送一个POST请求,如下
class MySpider(scrapy.Spider):
    name = 'myspider'

    def start_requests(self):
        return [scrapy.FormRequest("http://www.example.com/login",
                                   formdata={'user': 'john', 'pass': 'secret'},
                                   callback=self.logged_in)]

    def logged_in(self, response):
        # here you would extract links to follow and return Requests for
        # each of them, with another callback
        pass

#8、parse(response)
这是默认的回调函数,所有的回调函数必须返回an iterable of Request and/or dicts or Item objects.

#9、log(message[, level, component]):了解
Wrapper that sends a log message through the Spider’s logger, kept for backwards compatibility. For more information see Logging from Spiders.

#10、closed(reason)
爬虫程序结束时自动触发

定制scrapy.spider属性与方法详解

4858.com 824858.com 83

去重规则应该多个爬虫共享的,但凡一个爬虫爬取了,其他都不要爬了,实现方式如下

#方法一:
1、新增类属性
visited=set() #类属性

2、回调函数parse方法内:
def parse(self, response):
    if response.url in self.visited:
        return None
    .......

    self.visited.add(response.url) 

#方法一改进:针对url可能过长,所以我们存放url的hash值
def parse(self, response):
        url=md5(response.request.url)
    if url in self.visited:
        return None
    .......

    self.visited.add(url) 

#方法二:Scrapy自带去重功能
配置文件:
DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter' #默认的去重规则帮我们去重,去重规则在内存中
DUPEFILTER_DEBUG = False
JOBDIR = "保存范文记录的日志路径,如:/root/"  # 最终路径为 /root/requests.seen,去重规则放文件中

scrapy自带去重规则默认为RFPDupeFilter,只需要我们指定
Request(...,dont_filter=False) ,如果dont_filter=True则告诉Scrapy这个URL不参与去重。

#方法三:
我们也可以仿照RFPDupeFilter自定义去重规则,

from scrapy.dupefilter import RFPDupeFilter,看源码,仿照BaseDupeFilter

#步骤一:在项目目录下自定义去重文件dup.py
class UrlFilter(object):
    def __init__(self):
        self.visited = set() #或者放到数据库

    @classmethod
    def from_settings(cls, settings):
        return cls()

    def request_seen(self, request):
        if request.url in self.visited:
            return True
        self.visited.add(request.url)

    def open(self):  # can return deferred
        pass

    def close(self, reason):  # can return a deferred
        pass

    def log(self, request, spider):  # log that a request has been filtered
        pass

#步骤二:配置文件settings.py:
DUPEFILTER_CLASS = '项目名.dup.UrlFilter'


# 源码分析:
from scrapy.core.scheduler import Scheduler
见Scheduler下的enqueue_request方法:self.df.request_seen(request)

去重法则:去除重复的url

4858.com 844858.com 85

#例一:
import scrapy

class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = [
        'http://www.example.com/1.html',
        'http://www.example.com/2.html',
        'http://www.example.com/3.html',
    ]

    def parse(self, response):
        self.logger.info('A response from %s just arrived!', response.url)


#例二:一个回调函数返回多个Requests和Items
import scrapy

class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = [
        'http://www.example.com/1.html',
        'http://www.example.com/2.html',
        'http://www.example.com/3.html',
    ]

    def parse(self, response):
        for h3 in response.xpath('//h3').extract():
            yield {"title": h3}

        for url in response.xpath('//a/@href').extract():
            yield scrapy.Request(url, callback=self.parse)


#例三:在start_requests()内直接指定起始爬取的urls,start_urls就没有用了,

import scrapy
from myproject.items import MyItem

class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['example.com']

    def start_requests(self):
        yield scrapy.Request('http://www.example.com/1.html', self.parse)
        yield scrapy.Request('http://www.example.com/2.html', self.parse)
        yield scrapy.Request('http://www.example.com/3.html', self.parse)

    def parse(self, response):
        for h3 in response.xpath('//h3').extract():
            yield MyItem(title=h3)

        for url in response.xpath('//a/@href').extract():
            yield scrapy.Request(url, callback=self.parse)

例子

4858.com 864858.com 87

我们可能需要在命令行为爬虫程序传递参数,比如传递初始的url,像这样
#命令行执行
scrapy crawl myspider -a category=electronics

#在__init__方法中可以接收外部传进来的参数
import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'

    def __init__(self, category=None, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.start_urls = ['http://www.example.com/categories/%s' % category]
        #...


#注意接收的参数全都是字符串,如果想要结构化的数据,你需要用类似json.loads的方法

参数字传送递

6、其余通用Spiders:https://docs.scrapy.org/en/latest/topics/spiders.html\#generic-spiders

**六Selectors**

#1 //与/    第一个是子子孙孙的意思第二个是儿子
#2 text
#3、extract与extract_first:从selector对象中解出内容
#4、属性:xpath的属性加前缀@
#4、嵌套查找
#5、设置默认值
#4、按照属性查找
#5、按照属性模糊查找
#6、正则表达式
#7、xpath相对路径
#8、带变量的xpath

4858.com 884858.com 89

response.selector.css()
response.selector.xpath()
可简写为
response.css()
response.xpath()

#1 //与/
response.xpath('//body/a/')#
response.css('div a::text')

>>> response.xpath('//body/a') #开头的//代表从整篇文档中寻找,body之后的/代表body的儿子
[]
>>> response.xpath('//body//a') #开头的//代表从整篇文档中寻找,body之后的//代表body的子子孙孙
[<Selector xpath='//body//a' data='<a href="image1.html">Name: My image 1 <'>, <Selector xpath='//body//a' data='<a href="image2.html">Name: My image 2 <'>, <Selector xpath='//body//a' data='<a href="
image3.html">Name: My image 3 <'>, <Selector xpath='//body//a' data='<a href="image4.html">Name: My image 4 <'>, <Selector xpath='//body//a' data='<a href="image5.html">Name: My image 5 <'>]

#2 text
>>> response.xpath('//body//a/text()')
>>> response.css('body a::text')

#3、extract与extract_first:从selector对象中解出内容
>>> response.xpath('//div/a/text()').extract()
['Name: My image 1 ', 'Name: My image 2 ', 'Name: My image 3 ', 'Name: My image 4 ', 'Name: My image 5 ']
>>> response.css('div a::text').extract()
['Name: My image 1 ', 'Name: My image 2 ', 'Name: My image 3 ', 'Name: My image 4 ', 'Name: My image 5 ']

>>> response.xpath('//div/a/text()').extract_first()
'Name: My image 1 '
>>> response.css('div a::text').extract_first()
'Name: My image 1 '

#4、属性:xpath的属性加前缀@
>>> response.xpath('//div/a/@href').extract_first()
'image1.html'
>>> response.css('div a::attr(href)').extract_first()
'image1.html'

#4、嵌套查找
>>> response.xpath('//div').css('a').xpath('@href').extract_first()
'image1.html'

#5、设置默认值
>>> response.xpath('//div[@id="xxx"]').extract_first(default="not found")
'not found'

#4、按照属性查找
response.xpath('//div[@id="images"]/a[@href="image3.html"]/text()').extract()
response.css('#images a[@href="image3.html"]/text()').extract()

#5、按照属性模糊查找
response.xpath('//a[contains(@href,"image")]/@href').extract()
response.css('a[href*="image"]::attr(href)').extract()

response.xpath('//a[contains(@href,"image")]/img/@src').extract()
response.css('a[href*="imag"] img::attr(src)').extract()

response.xpath('//*[@href="image1.html"]')
response.css('*[href="image1.html"]')

#6、正则表达式
response.xpath('//a/text()').re(r'Name: (.*)')
response.xpath('//a/text()').re_first(r'Name: (.*)')

#7、xpath相对路径
>>> res=response.xpath('//a[contains(@href,"3")]')[0]
>>> res.xpath('img')
[<Selector xpath='img' data='<img src="image3_thumb.jpg">'>]
>>> res.xpath('./img')
[<Selector xpath='./img' data='<img src="image3_thumb.jpg">'>]
>>> res.xpath('.//img')
[<Selector xpath='.//img' data='<img src="image3_thumb.jpg">'>]
>>> res.xpath('//img') #这就是从头开始扫描
[<Selector xpath='//img' data='<img src="image1_thumb.jpg">'>, <Selector xpath='//img' data='<img src="image2_thumb.jpg">'>, <Selector xpath='//img' data='<img src="image3_thumb.jpg">'>, <Selector xpa
th='//img' data='<img src="image4_thumb.jpg">'>, <Selector xpath='//img' data='<img src="image5_thumb.jpg">'>]

#8、带变量的xpath
>>> response.xpath('//div[@id=$xxx]/a/text()',xxx='images').extract_first()
'Name: My image 1 '
>>> response.xpath('//div[count(a)=$yyy]/@id',yyy=5).extract_first() #求有5个a标签的div的id
'images'

View Code

七Items

八Item Pipeline

4858.com 904858.com 91

#一:可以写多个Pipeline类
#1、如果优先级高的Pipeline的process_item返回一个值或者None,会自动传给下一个pipline的process_item,
#2、如果只想让第一个Pipeline执行,那得让第一个pipline的process_item抛出异常raise DropItem()

#3、可以用spider.name == '爬虫名' 来控制哪些爬虫用哪些pipeline

二:示范
from scrapy.exceptions import DropItem

class CustomPipeline(object):
    def __init__(self,v):
        self.value = v

    @classmethod
    def from_crawler(cls, crawler):
        """
        Scrapy会先通过getattr判断我们是否自定义了from_crawler,有则调它来完
        成实例化
        """
        val = crawler.settings.getint('MMMM')
        return cls(val)

    def open_spider(self,spider):
        """
        爬虫刚启动时执行一次
        """
        print('000000')

    def close_spider(self,spider):
        """
        爬虫关闭时执行一次
        """
        print('111111')


    def process_item(self, item, spider):
        # 操作并进行持久化

        # return表示会被后续的pipeline继续处理
        return item

        # 表示将item丢弃,不会被后续pipeline处理
        # raise DropItem()

自定义pipeline

4858.com 924858.com 93

#1、settings.py
HOST="127.0.0.1"
PORT=27017
USER="root"
PWD="123"
DB="amazon"
TABLE="goods"



ITEM_PIPELINES = {
   'Amazon.pipelines.CustomPipeline': 200,
}

#2、pipelines.py
class CustomPipeline(object):
    def __init__(self,host,port,user,pwd,db,table):
        self.host=host
        self.port=port
        self.user=user
        self.pwd=pwd
        self.db=db
        self.table=table

    @classmethod
    def from_crawler(cls, crawler):
        """
        Scrapy会先通过getattr判断我们是否自定义了from_crawler,有则调它来完
        成实例化
        """
        HOST = crawler.settings.get('HOST')
        PORT = crawler.settings.get('PORT')
        USER = crawler.settings.get('USER')
        PWD = crawler.settings.get('PWD')
        DB = crawler.settings.get('DB')
        TABLE = crawler.settings.get('TABLE')
        return cls(HOST,PORT,USER,PWD,DB,TABLE)

    def open_spider(self,spider):
        """
        爬虫刚启动时执行一次
        """
        self.client = MongoClient('mongodb://%s:%s@%s:%s' %(self.user,self.pwd,self.host,self.port))

    def close_spider(self,spider):
        """
        爬虫关闭时执行一次
        """
        self.client.close()


    def process_item(self, item, spider):
        # 操作并进行持久化

        self.client[self.db][self.table].save(dict(item))

示范

九Dowloader Middeware

4858.com 944858.com 95

class DownMiddleware1(object):
    def process_request(self, request, spider):
        """
        请求需要被下载时,经过所有下载器中间件的process_request调用
        :param request: 
        :param spider: 
        :return:  
            None,继续后续中间件去下载;
            Response对象,停止process_request的执行,开始执行process_response
            Request对象,停止中间件的执行,将Request重新调度器
            raise IgnoreRequest异常,停止process_request的执行,开始执行process_exception
        """
        pass



    def process_response(self, request, response, spider):
        """
        spider处理完成,返回时调用
        :param response:
        :param result:
        :param spider:
        :return: 
            Response 对象:转交给其他中间件process_response
            Request 对象:停止中间件,request会被重新调度下载
            raise IgnoreRequest 异常:调用Request.errback
        """
        print('response1')
        return response

    def process_exception(self, request, exception, spider):
        """
        当下载处理器(download handler)或 process_request() (下载中间件)抛出异常
        :param response:
        :param exception:
        :param spider:
        :return: 
            None:继续交给后续中间件处理异常;
            Response对象:停止后续process_exception方法
            Request对象:停止中间件,request将会被重新调用下载
        """
        return None

下载器中间件

4858.com 964858.com 97

class DownMiddleware1(object):
    @staticmethod
    def get_proxy():
        return requests.get("http://127.0.0.1:5010/get/").text

    @staticmethod
    def delete_proxy(proxy):
        requests.get("http://127.0.0.1:5010/delete/?proxy={}".format(proxy))

    def process_request(self, request, spider):
        """
        请求需要被下载时,经过所有下载器中间件的process_request调用
        :param request:
        :param spider:
        :return:
            None,继续后续中间件去下载;
            Response对象,停止process_request的执行,开始执行process_response
            Request对象,停止中间件的执行,将Request重新调度器
            raise IgnoreRequest异常,停止process_request的执行,开始执行process_exception
        """

        if not hasattr(DownMiddleware1,'proxy_addr'):
            DownMiddleware1.proxy_addr = self.get_proxy()
        request.meta['download_timeout'] = 5
        request.meta["proxy"] = "http://" + self.proxy_addr

        print('元数据',request.meta)

        if request.meta.get('depth') == 10 or request.meta.get('retry_times') == 2:
            request.meta['depth'] = 0
            request.meta['retry_times']=0
            self.delete_proxy(self.proxy_addr)

            DownMiddleware1.proxy_addr=self.get_proxy()
            request.meta["proxy"] = "http://" + self.proxy_addr
            print('============>',request.meta)
            return request

        return None

View Code

十Spider Middleware

4858.com 984858.com 99

class SpiderMiddleware(object):

    def process_spider_input(self,response, spider):
        """
        下载完成,执行,然后交给parse处理
        :param response: 
        :param spider: 
        :return: 
        """
        pass

    def process_spider_output(self,response, result, spider):
        """
        spider处理完成,返回时调用
        :param response:
        :param result:
        :param spider:
        :return: 必须返回包含 Request 或 Item 对象的可迭代对象(iterable)
        """
        return result

    def process_spider_exception(self,response, exception, spider):
        """
        异常调用
        :param response:
        :param exception:
        :param spider:
        :return: None,继续交给后续中间件处理异常;含 Response 或 Item 的可迭代对象(iterable),交给调度器或pipeline
        """
        return None


    def process_start_requests(self,start_requests, spider):
        """
        爬虫启动时调用
        :param start_requests:
        :param spider:
        :return: 包含 Request 对象的可迭代对象
        """
        return start_requests

爬虫中间件

十一 settings.py

4858.com 1004858.com 101

# -*- coding: utf-8 -*-

# Scrapy settings for step8_king project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

# 1. 爬虫名称
BOT_NAME = 'step8_king'

# 2. 爬虫应用路径
SPIDER_MODULES = ['step8_king.spiders']
NEWSPIDER_MODULE = 'step8_king.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
# 3. 客户端 user-agent请求头
# USER_AGENT = 'step8_king (+http://www.yourdomain.com)'

# Obey robots.txt rules
# 4. 禁止爬虫配置
# ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
# 5. 并发请求数
# CONCURRENT_REQUESTS = 4

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# 6. 延迟下载秒数
# DOWNLOAD_DELAY = 2


# The download delay setting will honor only one of:
# 7. 单域名访问并发数,并且延迟下次秒数也应用在每个域名
# CONCURRENT_REQUESTS_PER_DOMAIN = 2
# 单IP访问并发数,如果有值则忽略:CONCURRENT_REQUESTS_PER_DOMAIN,并且延迟下次秒数也应用在每个IP
# CONCURRENT_REQUESTS_PER_IP = 3

# Disable cookies (enabled by default)
# 8. 是否支持cookie,cookiejar进行操作cookie
# COOKIES_ENABLED = True
# COOKIES_DEBUG = True

# Disable Telnet Console (enabled by default)
# 9. Telnet用于查看当前爬虫的信息,操作爬虫等...
#    使用telnet ip port ,然后通过命令操作
# TELNETCONSOLE_ENABLED = True
# TELNETCONSOLE_HOST = '127.0.0.1'
# TELNETCONSOLE_PORT = [6023,]


# 10. 默认请求头
# Override the default request headers:
# DEFAULT_REQUEST_HEADERS = {
#     'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#     'Accept-Language': 'en',
# }


# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
# 11. 定义pipeline处理请求
# ITEM_PIPELINES = {
#    'step8_king.pipelines.JsonPipeline': 700,
#    'step8_king.pipelines.FilePipeline': 500,
# }



# 12. 自定义扩展,基于信号进行调用
# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
# EXTENSIONS = {
#     # 'step8_king.extensions.MyExtension': 500,
# }


# 13. 爬虫允许的最大深度,可以通过meta查看当前深度;0表示无深度
# DEPTH_LIMIT = 3

# 14. 爬取时,0表示深度优先Lifo(默认);1表示广度优先FiFo

# 后进先出,深度优先
# DEPTH_PRIORITY = 0
# SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleLifoDiskQueue'
# SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.LifoMemoryQueue'
# 先进先出,广度优先

# DEPTH_PRIORITY = 1
# SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
# SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'

# 15. 调度器队列
# SCHEDULER = 'scrapy.core.scheduler.Scheduler'
# from scrapy.core.scheduler import Scheduler


# 16. 访问URL去重
# DUPEFILTER_CLASS = 'step8_king.duplication.RepeatUrl'


# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html

"""
17. 自动限速算法
    from scrapy.contrib.throttle import AutoThrottle
    自动限速设置
    1. 获取最小延迟 DOWNLOAD_DELAY
    2. 获取最大延迟 AUTOTHROTTLE_MAX_DELAY
    3. 设置初始下载延迟 AUTOTHROTTLE_START_DELAY
    4. 当请求下载完成后,获取其"连接"时间 latency,即:请求连接到接受到响应头之间的时间
    5. 用于计算的... AUTOTHROTTLE_TARGET_CONCURRENCY
    target_delay = latency / self.target_concurrency
    new_delay = (slot.delay + target_delay) / 2.0 # 表示上一次的延迟时间
    new_delay = max(target_delay, new_delay)
    new_delay = min(max(self.mindelay, new_delay), self.maxdelay)
    slot.delay = new_delay
"""

# 开始自动限速
# AUTOTHROTTLE_ENABLED = True
# The initial download delay
# 初始下载延迟
# AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
# 最大下载延迟
# AUTOTHROTTLE_MAX_DELAY = 10
# The average number of requests Scrapy should be sending in parallel to each remote server
# 平均每秒并发数
# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every response received:
# 是否显示
# AUTOTHROTTLE_DEBUG = True

# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings


"""
18. 启用缓存
    目的用于将已经发送的请求或相应缓存下来,以便以后使用

    from scrapy.downloadermiddlewares.httpcache import HttpCacheMiddleware
    from scrapy.extensions.httpcache import DummyPolicy
    from scrapy.extensions.httpcache import FilesystemCacheStorage
"""
# 是否启用缓存策略
# HTTPCACHE_ENABLED = True

# 缓存策略:所有请求均缓存,下次在请求直接访问原来的缓存即可
# HTTPCACHE_POLICY = "scrapy.extensions.httpcache.DummyPolicy"
# 缓存策略:根据Http响应头:Cache-Control、Last-Modified 等进行缓存的策略
# HTTPCACHE_POLICY = "scrapy.extensions.httpcache.RFC2616Policy"

# 缓存超时时间
# HTTPCACHE_EXPIRATION_SECS = 0

# 缓存保存路径
# HTTPCACHE_DIR = 'httpcache'

# 缓存忽略的Http状态码
# HTTPCACHE_IGNORE_HTTP_CODES = []

# 缓存存储的插件
# HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'


"""
19. 代理,需要在环境变量中设置
    from scrapy.contrib.downloadermiddleware.httpproxy import HttpProxyMiddleware

    方式一:使用默认
        os.environ
        {
            http_proxy:http://root:woshiniba@192.168.11.11:9999/
            https_proxy:http://192.168.11.11:9999/
        }
    方式二:使用自定义下载中间件

    def to_bytes(text, encoding=None, errors='strict'):
        if isinstance(text, bytes):
            return text
        if not isinstance(text, six.string_types):
            raise TypeError('to_bytes must receive a unicode, str or bytes '
                            'object, got %s' % type(text).__name__)
        if encoding is None:
            encoding = 'utf-8'
        return text.encode(encoding, errors)

    class ProxyMiddleware(object):
        def process_request(self, request, spider):
            PROXIES = [
                {'ip_port': '111.11.228.75:80', 'user_pass': ''},
                {'ip_port': '120.198.243.22:80', 'user_pass': ''},
                {'ip_port': '111.8.60.9:8123', 'user_pass': ''},
                {'ip_port': '101.71.27.120:80', 'user_pass': ''},
                {'ip_port': '122.96.59.104:80', 'user_pass': ''},
                {'ip_port': '122.224.249.122:8088', 'user_pass': ''},
            ]
            proxy = random.choice(PROXIES)
            if proxy['user_pass'] is not None:
                request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])
                encoded_user_pass = base64.encodestring(to_bytes(proxy['user_pass']))
                request.headers['Proxy-Authorization'] = to_bytes('Basic ' + encoded_user_pass)
                print "**************ProxyMiddleware have pass************" + proxy['ip_port']
            else:
                print "**************ProxyMiddleware no pass************" + proxy['ip_port']
                request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])

    DOWNLOADER_MIDDLEWARES = {
       'step8_king.middlewares.ProxyMiddleware': 500,
    }

"""

"""
20. Https访问
    Https访问时有两种情况:
    1. 要爬取网站使用的可信任证书(默认支持)
        DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory"
        DOWNLOADER_CLIENTCONTEXTFACTORY = "scrapy.core.downloader.contextfactory.ScrapyClientContextFactory"

    2. 要爬取网站使用的自定义证书
        DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory"
        DOWNLOADER_CLIENTCONTEXTFACTORY = "step8_king.https.MySSLFactory"

        # https.py
        from scrapy.core.downloader.contextfactory import ScrapyClientContextFactory
        from twisted.internet.ssl import (optionsForClientTLS, CertificateOptions, PrivateCertificate)

        class MySSLFactory(ScrapyClientContextFactory):
            def getCertificateOptions(self):
                from OpenSSL import crypto
                v1 = crypto.load_privatekey(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.key.unsecure', mode='r').read())
                v2 = crypto.load_certificate(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.pem', mode='r').read())
                return CertificateOptions(
                    privateKey=v1,  # pKey对象
                    certificate=v2,  # X509对象
                    verify=False,
                    method=getattr(self, 'method', getattr(self, '_ssl_method', None))
                )
    其他:
        相关类
            scrapy.core.downloader.handlers.http.HttpDownloadHandler
            scrapy.core.downloader.webclient.ScrapyHTTPClientFactory
            scrapy.core.downloader.contextfactory.ScrapyClientContextFactory
        相关配置
            DOWNLOADER_HTTPCLIENTFACTORY
            DOWNLOADER_CLIENTCONTEXTFACTORY

"""



"""
21. 爬虫中间件
    class SpiderMiddleware(object):

        def process_spider_input(self,response, spider):
            '''
            下载完成,执行,然后交给parse处理
            :param response: 
            :param spider: 
            :return: 
            '''
            pass

        def process_spider_output(self,response, result, spider):
            '''
            spider处理完成,返回时调用
            :param response:
            :param result:
            :param spider:
            :return: 必须返回包含 Request 或 Item 对象的可迭代对象(iterable)
            '''
            return result

        def process_spider_exception(self,response, exception, spider):
            '''
            异常调用
            :param response:
            :param exception:
            :param spider:
            :return: None,继续交给后续中间件处理异常;含 Response 或 Item 的可迭代对象(iterable),交给调度器或pipeline
            '''
            return None


        def process_start_requests(self,start_requests, spider):
            '''
            爬虫启动时调用
            :param start_requests:
            :param spider:
            :return: 包含 Request 对象的可迭代对象
            '''
            return start_requests

    内置爬虫中间件:
        'scrapy.contrib.spidermiddleware.httperror.HttpErrorMiddleware': 50,
        'scrapy.contrib.spidermiddleware.offsite.OffsiteMiddleware': 500,
        'scrapy.contrib.spidermiddleware.referer.RefererMiddleware': 700,
        'scrapy.contrib.spidermiddleware.urllength.UrlLengthMiddleware': 800,
        'scrapy.contrib.spidermiddleware.depth.DepthMiddleware': 900,

"""
# from scrapy.contrib.spidermiddleware.referer import RefererMiddleware
# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
   # 'step8_king.middlewares.SpiderMiddleware': 543,
}


"""
22. 下载中间件
    class DownMiddleware1(object):
        def process_request(self, request, spider):
            '''
            请求需要被下载时,经过所有下载器中间件的process_request调用
            :param request:
            :param spider:
            :return:
                None,继续后续中间件去下载;
                Response对象,停止process_request的执行,开始执行process_response
                Request对象,停止中间件的执行,将Request重新调度器
                raise IgnoreRequest异常,停止process_request的执行,开始执行process_exception
            '''
            pass



        def process_response(self, request, response, spider):
            '''
            spider处理完成,返回时调用
            :param response:
            :param result:
            :param spider:
            :return:
                Response 对象:转交给其他中间件process_response
                Request 对象:停止中间件,request会被重新调度下载
                raise IgnoreRequest 异常:调用Request.errback
            '''
            print('response1')
            return response

        def process_exception(self, request, exception, spider):
            '''
            当下载处理器(download handler)或 process_request() (下载中间件)抛出异常
            :param response:
            :param exception:
            :param spider:
            :return:
                None:继续交给后续中间件处理异常;
                Response对象:停止后续process_exception方法
                Request对象:停止中间件,request将会被重新调用下载
            '''
            return None


    默认下载中间件
    {
        'scrapy.contrib.downloadermiddleware.robotstxt.RobotsTxtMiddleware': 100,
        'scrapy.contrib.downloadermiddleware.httpauth.HttpAuthMiddleware': 300,
        'scrapy.contrib.downloadermiddleware.downloadtimeout.DownloadTimeoutMiddleware': 350,
        'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': 400,
        'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 500,
        'scrapy.contrib.downloadermiddleware.defaultheaders.DefaultHeadersMiddleware': 550,
        'scrapy.contrib.downloadermiddleware.redirect.MetaRefreshMiddleware': 580,
        'scrapy.contrib.downloadermiddleware.httpcompression.HttpCompressionMiddleware': 590,
        'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': 600,
        'scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware': 700,
        'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 750,
        'scrapy.contrib.downloadermiddleware.chunked.ChunkedTransferMiddleware': 830,
        'scrapy.contrib.downloadermiddleware.stats.DownloaderStats': 850,
        'scrapy.contrib.downloadermiddleware.httpcache.HttpCacheMiddleware': 900,
    }

"""
# from scrapy.contrib.downloadermiddleware.httpauth import HttpAuthMiddleware
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# DOWNLOADER_MIDDLEWARES = {
#    'step8_king.middlewares.DownMiddleware1': 100,
#    'step8_king.middlewares.DownMiddleware2': 500,
# }

settings.py

 

十二爬取亚马逊(Amazon)商品新闻

4858.com 1024858.com 103

1、
scrapy startproject Amazon
cd Amazon
scrapy genspider spider_goods www.amazon.cn

2、settings.py
ROBOTSTXT_OBEY = False
#请求头
DEFAULT_REQUEST_HEADERS = {
    'Referer':'https://www.amazon.cn/',
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36'
}
#打开注释
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = 'httpcache'
HTTPCACHE_IGNORE_HTTP_CODES = []
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

3、items.py
class GoodsItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    #商品名字
    goods_name = scrapy.Field()
    #价钱
    goods_price = scrapy.Field()
    #配送方式
    delivery_method=scrapy.Field()

4、spider_goods.py
# -*- coding: utf-8 -*-
import scrapy

from Amazon.items import  GoodsItem
from scrapy.http import Request
from urllib.parse import urlencode

class SpiderGoodsSpider(scrapy.Spider):
    name = 'spider_goods'
    allowed_domains = ['www.amazon.cn']
    # start_urls = ['http://www.amazon.cn/']


    def __int__(self,keyword=None,*args,**kwargs):
        super(SpiderGoodsSpider).__init__(*args,**kwargs)
        self.keyword=keyword

    def start_requests(self):
        url='https://www.amazon.cn/s/ref=nb_sb_noss_1?'
        paramas={
            '__mk_zh_CN': '亚马逊网站',
            'url': 'search - alias = aps',
            'field-keywords': self.keyword
        }
        url=url+urlencode(paramas,encoding='utf-8')
        yield Request(url,callback=self.parse_index)


    def parse_index(self, response):
        print('解析索引页:%s' %response.url)

        urls=response.xpath('//*[contains(@id,"result_")]/div/div[3]/div[1]/a/@href').extract()
        for url in urls:
            yield Request(url,callback=self.parse_detail)

        next_url=response.urljoin(response.xpath('//*[@id="pagnNextLink"]/@href').extract_first())
        print('下一页的url',next_url)
        yield Request(next_url,callback=self.parse_index)

    def parse_detail(self,response):
        print('解析详情页:%s' %(response.url))

        item=GoodsItem()
        # 商品名字
        item['goods_name'] = response.xpath('//*[@id="productTitle"]/text()').extract_first().strip()
        # 价钱
        item['goods_price'] = response.xpath('//*[@id="priceblock_ourprice"]/text()').extract_first().strip()
        # 配送方式
        item['delivery_method'] = ''.join(response.xpath('//*[@id="ddmMerchantMessage"]//text()').extract())
        return item

5、自定义pipelines
#sql.py
import pymysql
import settings


MYSQL_HOST=settings.MYSQL_HOST
MYSQL_PORT=settings.MYSQL_PORT
MYSQL_USER=settings.MYSQL_USER
MYSQL_PWD=settings.MYSQL_PWD
MYSQL_DB=settings.MYSQL_DB

conn=pymysql.connect(
    host=MYSQL_HOST,
    port=int(MYSQL_PORT),
    user=MYSQL_USER,
    password=MYSQL_PWD,
    db=MYSQL_DB,
    charset='utf8'
)
cursor=conn.cursor()

class Mysql(object):
    @staticmethod
    def insert_tables_goods(goods_name,goods_price,deliver_mode):
        sql='insert into goods(goods_name,goods_price,delivery_method) values(%s,%s,%s)'
        cursor.execute(sql,args=(goods_name,goods_price,deliver_mode))
        conn.commit()

    @staticmethod
    def is_repeat(goods_name):
        sql='select count(1) from goods where goods_name=%s'
        cursor.execute(sql,args=(goods_name,))
        if cursor.fetchone()[0] >= 1:
            return True

if __name__ == '__main__':
    cursor.execute('select * from goods;')
    print(cursor.fetchall())


#pipelines.py
from Amazon.mysqlpipelines.sql import Mysql


class AmazonPipeline(object):
    def process_item(self, item, spider):
        goods_name=item['goods_name']
        goods_price=item['goods_price']
        delivery_mode=item['delivery_method']
        if not Mysql.is_repeat(goods_name):
            Mysql.insert_table_goods(goods_name,goods_price,delivery_mode)



6、创建数据库表
create database amazon charset utf8;
create table goods(
    id int primary key auto_increment,
    goods_name char(30),
    goods_price char(20),
    delivery_method varchar(50)
);

7、settings.py
MYSQL_HOST='localhost'
MYSQL_PORT='3306'
MYSQL_USER='root'
MYSQL_PWD='123'
MYSQL_DB='amazon'


#数字代表优先级程度(1-1000随意设置,数值越低,组件的优先级越高)
ITEM_PIPELINES = {
   'Amazon.mysqlpipelines.pipelines.mazonPipeline': 1,
}


#8、在项目目录下新建:entrypoint.py
from scrapy.cmdline import execute
execute(['scrapy', 'crawl', 'spider_goods','-a','keyword=iphone8'])

View Code

 

 

发表评论

电子邮件地址不会被公开。 必填项已用*标注

网站地图xml地图
Copyright @ 2010-2019 美高梅手机版4858 版权所有