2018年1月15日星期一晴

一、安装 pip install Scrapy 不知道PIP请自行搜索，我安装比较顺利，环境是win7 64位，python是2.7 32位。

二、本地使用和例子本文以爬doutula.com的表情ID和标签为例，用scrapy爬一下。

建立工程在命令行下，运行scrapy startproject doutula 其中，doutula是网站的名字。运行命令后会建立一个新的工程，我们往里面填就是了。

我本地的框架如下 go.py —我自己写的本地运行的命令 scrapy.cfg —网络部署时的配置 doutula spiders – init.py – dtl.py —框架生成后改名的爬虫，名字不要和工程名字一样 init.py items.py —框架生成，有点类似models.py middlewares.py —框架生成，scrapy的中间件 pipelines.py —框架生成，scrapy用来持久化数据的部分 settings.py —框架生成，scrapy的设置 ualist.py —我新增的，里面就是一个user-agent的list

下面逐个讲述具体每个文件的作用 (1). go.py 就改第三个参数就行了，也就是爬虫的名字 [code] #! /usr/bin/env python #coding=utf-8 from scrapy.cmdline import execute execute([‘scrapy’,‘crawl’,‘dtl’]) [/code]

(2)scrapy.cfg 看看就行，一般不用改

（3）items.py 有点类似models.py，依葫芦画瓢填写要爬的内容即可 [code] import scrapy class DoutulaItem(scrapy.Item): name = scrapy.Field() pid = scrapy.Field() [/code]

（4）middlewares.py 中间件，看看有什么可以在这里改的。我在这里尝试随机更换了user agent，是可行的。 [code] def process_start_requests(self, start_requests, spider): # Called with the start requests of the spider, and works # similarly to the process_spider_output() method, except # that it doesn’t have a response associated.

    # Must return only requests (not items).
    for r in start_requests:
        r.headers["User-Agent"] = getRandomUA()#yobin added 20180108，这里仅供参考，注释掉吧
        #print r.headers["User-Agent"]
        yield r

[/code]

（5）settings.py 爬虫的配置文件，很重要。对了，我安装了scrapy-redis，可以github搜索一下。scrapy-redis可以去重之类的，反正好处多多。

我的配置文件如下： [code] BOT_NAME = ‘doutula’

SPIDER_MODULES = [‘doutula.spiders’] NEWSPIDER_MODULE = ‘doutula.spiders’

Crawl responsibly by identifying yourself (and your website) on the user-agent

USER_AGENT = ‘User-Agent",“Mozilla/4.0 (compatible; MSIE 7.1; Windows NT 5.1; SV1)’

Obey robots.txt rules

ROBOTSTXT_OBEY = False

#scrapy-redis begin

Enables scheduling storing requests queue in redis.

#SCHEDULER = “scrapy_redis.scheduler.Scheduler” #如果打开的话，之前拍过的就不爬了。

DUPEFILTER_CLASS = “scrapy_redis.dupefilter.RFPDupeFilter”

Don’t cleanup redis queues, allows to pause/resume crawls.

SCHEDULER_PERSIST = True

#REDIS_URL = ‘redis://127.0.0.1:6397’ #scrapy-redis end

Configure maximum concurrent requests performed by Scrapy (default: 16)

CONCURRENT_REQUESTS = 16

Configure a delay for requests for the same website (default: 0)

See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay

The download delay setting will honor only one of:

CONCURRENT_REQUESTS_PER_DOMAIN = 16 CONCURRENT_REQUESTS_PER_IP = 16

Disable cookies (enabled by default)

#COOKIES_ENABLED = False

Disable Telnet Console (enabled by default)

#TELNETCONSOLE_ENABLED = False

Override the default request headers:

#DEFAULT_REQUEST_HEADERS = {

‘Accept’: ’text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8’,

‘Accept-Language’: ’en’,

Enable or disable spider middlewares

See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

SPIDER_MIDDLEWARES = { ‘doutula.middlewares.DoutulaSpiderMiddleware’: 543, } #redis有一个middleware的，543这种是优先级，数字越低，优先级越高

Enable or disable downloader middlewares

See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html

#DOWNLOADER_MIDDLEWARES = {

‘doutula.middlewares.MyCustomDownloaderMiddleware’: 543,

Enable or disable extensions

See http://scrapy.readthedocs.org/en/latest/topics/extensions.html

#EXTENSIONS = {

‘scrapy.extensions.telnet.TelnetConsole’: None,

Configure item pipelines

See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html

ITEM_PIPELINES = { ‘doutula.pipelines.DoutulaPipeline’: 300, #‘scrapy_redis.pipelines.RedisPipeline’: 400, } #pipeline也是有优先级的。

#下面是mongo db的配置，可以import使用 MONGO_URI = “localhost” MONGODB_PORT = 27017 MONGODB_DB = “yydb” MONGODB_COLLECTION = “pidc”

Enable and configure the AutoThrottle extension (disabled by default)

See http://doc.scrapy.org/en/latest/topics/autothrottle.html

#AUTOTHROTTLE_ENABLED = True

The initial download delay

#AUTOTHROTTLE_START_DELAY = 5

The maximum download delay to be set in case of high latencies

#AUTOTHROTTLE_MAX_DELAY = 60

The average number of requests Scrapy should be sending in parallel to

each remote server

#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

Enable showing throttling stats for every response received:

#AUTOTHROTTLE_DEBUG = False

Enable and configure HTTP caching (disabled by default)

See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

#HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = ‘httpcache’ #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = ‘scrapy.extensions.httpcache.FilesystemCacheStorage’ #想爬的更快，并且减轻网站压力，最好把上面的HTTPCACHE的注释都打开一下。

MONGO_URI = ’localhost’ MONGO_DATABASE = ’’ [/code]

（6）dtl.py 爬虫的文件名千万不能和项目的工程名字相同，否则可能会import失败。

我重命名的爬虫，这里主要是生成要爬的页面，并处理页面。通过异步回调来处理。解析处理网页，可以用多种方式，比如re、beautifulsoup等，我这里用的是xpath。我还是习惯用正则，没感到xpath有多少优点。

例子中的xpath就不讲了，需要自学一下。

[code]

-- coding: utf-8 --

import scrapy

from scrapy.http import Request from doutula.items import DoutulaItem

class doutulaSpider(scrapy.Spider): name = ‘dtl’ allowed_domains = [‘doutula.com’] #允许爬的域名 #start_urls = [‘https://www.doutula.com/article/list/?page=1’] base_url = ‘https://www.doutula.com/article/list/?page=%d'

def start_requests(self):#重载这个函数，yield生成要爬的url
    for i in range(1, 541):
        url = self.base_url % (i)
        yield Request(url, self.parse)

def parse(self, response):
    item = DoutulaItem()
    pids = response.xpath('''//img[@class="lazy image_dtb img-responsive"]/@data-original''').extract()
    alts = response.xpath('''//img[@class="lazy image_dtb img-responsive"]/@alt''').extract()
    for loop,pid in enumerate(pids):
        item['pid']  = pid[pid.rfind('/')+1:pid.rfind('.')]
        item['name'] = alts[loop]
        yield item  #这里生成item，下一步就传到pipline那边处理了

[/code]

（7）pipelines.py 这个文件主要是用来持久化爬来的数据，你想存成什么样子都行，比如json格式

[code] class JsonPipeline(object): def open_spider(self, spider): self.file = open(‘items.json’, ‘w’) def close_spider(self, spider): self.file.close() def process_item(self, item, spider): if item[’name’] and item[‘pid’]: line = json.dumps(dict(item)) + “\n” self.file.write(line) return item [/code]

再举例一个mongodb的（需要安装monogodb），其实照着例子填就行。 [code] from scrapy.exceptions import DropItem import pymongo import json

class DoutulaPipeline(object): collection_name = ‘pidc’

def __init__(self, mongo_uri, mongo_db):
    self.mongo_uri = mongo_uri
    self.mongo_db = mongo_db
    self.ids_seen = set()
    self.names_seen = set()

@classmethod
def from_crawler(cls, crawler):
    return cls(
        mongo_uri=crawler.settings.get('MONGO_URI'),
        mongo_db=crawler.settings.get('MONGODB_DB', 'items')
    )

def open_spider(self, spider):
    self.client = pymongo.MongoClient(self.mongo_uri)
    self.db = self.client[self.mongo_db]

def close_spider(self, spider):
    self.client.close()

def process_item(self, item, spider):
    if len(item['pid']) <> 32:
        raise DropItem("Drop item found: %s" % item)
    else:
        if item['pid'] in self.ids_seen or item['name'] in self.names_seen:
            raise DropItem("Duplicate item found: %s" % item)
        else:
            self.ids_seen.add(item['pid'])
            self.names_seen.add(item['name'])
            self.db[self.collection_name].insert_one(dict(item))
            return item

[/code]

以上例子，需要先同时运行monodb服务和redis服务，最后再运行scrapy。如果觉得麻烦，可以注释掉monogodb和redis相关代码运行。

爬虫(2) Scrapy的安装与简单应用

Crawl responsibly by identifying yourself (and your website) on the user-agent

Obey robots.txt rules

Enables scheduling storing requests queue in redis.

Don’t cleanup redis queues, allows to pause/resume crawls.

Configure maximum concurrent requests performed by Scrapy (default: 16)

Configure a delay for requests for the same website (default: 0)

See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay

See also autothrottle settings and docs

The download delay setting will honor only one of:

Disable cookies (enabled by default)

Disable Telnet Console (enabled by default)

Override the default request headers:

‘Accept’: ’text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8’,

‘Accept-Language’: ’en’,

Enable or disable spider middlewares

See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

Enable or disable downloader middlewares

See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html

‘doutula.middlewares.MyCustomDownloaderMiddleware’: 543,

Enable or disable extensions

See http://scrapy.readthedocs.org/en/latest/topics/extensions.html

‘scrapy.extensions.telnet.TelnetConsole’: None,

Configure item pipelines

See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html

Enable and configure the AutoThrottle extension (disabled by default)

See http://doc.scrapy.org/en/latest/topics/autothrottle.html

The initial download delay

The maximum download delay to be set in case of high latencies

The average number of requests Scrapy should be sending in parallel to

each remote server

Enable showing throttling stats for every response received:

Enable and configure HTTP caching (disabled by default)

See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

-- coding: utf-8 --

Crawl responsibly by identifying yourself (and your website) on the user-agent#

Obey robots.txt rules#

Enables scheduling storing requests queue in redis.#

Ensure all spiders share same duplicates filter through redis.#

Don’t cleanup redis queues, allows to pause/resume crawls.#

Configure maximum concurrent requests performed by Scrapy (default: 16)#

Configure a delay for requests for the same website (default: 0)#

See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay#

See also autothrottle settings and docs#

The download delay setting will honor only one of:#

Disable cookies (enabled by default)#

Disable Telnet Console (enabled by default)#

Override the default request headers:#

‘Accept’: ’text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8’,#

‘Accept-Language’: ’en’,#

Enable or disable spider middlewares#

See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html#

Enable or disable downloader middlewares#

See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#

‘doutula.middlewares.MyCustomDownloaderMiddleware’: 543,#

Enable or disable extensions#

See http://scrapy.readthedocs.org/en/latest/topics/extensions.html#

‘scrapy.extensions.telnet.TelnetConsole’: None,#

Configure item pipelines#

See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html#

Enable and configure the AutoThrottle extension (disabled by default)#

See http://doc.scrapy.org/en/latest/topics/autothrottle.html#

The initial download delay#

The maximum download delay to be set in case of high latencies#

The average number of requests Scrapy should be sending in parallel to#

each remote server#

Enable showing throttling stats for every response received:#

Enable and configure HTTP caching (disabled by default)#

See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings#

-- coding: utf-8 --#

Crawl responsibly by identifying yourself (and your website) on the user-agent

Obey robots.txt rules

Enables scheduling storing requests queue in redis.

Ensure all spiders share same duplicates filter through redis.

Don’t cleanup redis queues, allows to pause/resume crawls.

Configure maximum concurrent requests performed by Scrapy (default: 16)

Configure a delay for requests for the same website (default: 0)

See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay

See also autothrottle settings and docs

The download delay setting will honor only one of:

Disable cookies (enabled by default)

Disable Telnet Console (enabled by default)

Override the default request headers:

‘Accept’: ’text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8’,

‘Accept-Language’: ’en’,

Enable or disable spider middlewares

See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

Enable or disable downloader middlewares

See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html

‘doutula.middlewares.MyCustomDownloaderMiddleware’: 543,

Enable or disable extensions

See http://scrapy.readthedocs.org/en/latest/topics/extensions.html

‘scrapy.extensions.telnet.TelnetConsole’: None,

Configure item pipelines

See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html

Enable and configure the AutoThrottle extension (disabled by default)

See http://doc.scrapy.org/en/latest/topics/autothrottle.html

The initial download delay

The maximum download delay to be set in case of high latencies

The average number of requests Scrapy should be sending in parallel to

each remote server

Enable showing throttling stats for every response received:

Enable and configure HTTP caching (disabled by default)

See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

-- coding: utf-8 --