2018年1月15日 星期一 晴

一、安装 pip install Scrapy 不知道PIP请自行搜索,我安装比较顺利,环境是win7 64位,python是2.7 32位。

二、本地使用和例子 本文以爬doutula.com的表情ID和标签为例,用scrapy爬一下。

  1. 建立工程 在命令行下,运行scrapy startproject doutula 其中,doutula是网站的名字。运行命令后会建立一个新的工程,我们往里面填就是了。

我本地的框架如下 go.py —我自己写的本地运行的命令 scrapy.cfg —网络部署时的配置 doutula spiders – init.py – dtl.py —框架生成后改名的爬虫,名字不要和工程名字一样 init.py items.py —框架生成,有点类似models.py middlewares.py —框架生成,scrapy的中间件 pipelines.py —框架生成,scrapy用来持久化数据的部分 settings.py —框架生成,scrapy的设置 ualist.py —我新增的,里面就是一个user-agent的list

下面逐个讲述具体每个文件的作用 (1). go.py 就改第三个参数就行了,也就是爬虫的名字 [code] #! /usr/bin/env python #coding=utf-8 from scrapy.cmdline import execute execute([‘scrapy’,‘crawl’,‘dtl’]) [/code]

(2)scrapy.cfg 看看就行,一般不用改

(3)items.py 有点类似models.py,依葫芦画瓢填写要爬的内容即可 [code] import scrapy class DoutulaItem(scrapy.Item): name = scrapy.Field() pid = scrapy.Field() [/code]

(4)middlewares.py 中间件,看看有什么可以在这里改的。我在这里尝试随机更换了user agent,是可行的。 [code] def process_start_requests(self, start_requests, spider): # Called with the start requests of the spider, and works # similarly to the process_spider_output() method, except # that it doesn’t have a response associated.

    # Must return only requests (not items).
    for r in start_requests:
        r.headers["User-Agent"] = getRandomUA()#yobin added 20180108,这里仅供参考,注释掉吧
        #print r.headers["User-Agent"]
        yield r

[/code]

(5)settings.py 爬虫的配置文件,很重要。 对了,我安装了scrapy-redis,可以github搜索一下。scrapy-redis可以去重之类的,反正好处多多。

我的配置文件如下: [code] BOT_NAME = ‘doutula’

SPIDER_MODULES = [‘doutula.spiders’] NEWSPIDER_MODULE = ‘doutula.spiders’

Crawl responsibly by identifying yourself (and your website) on the user-agent

USER_AGENT = ‘User-Agent",“Mozilla/4.0 (compatible; MSIE 7.1; Windows NT 5.1; SV1)’

Obey robots.txt rules

ROBOTSTXT_OBEY = False

#scrapy-redis begin

Enables scheduling storing requests queue in redis.

#SCHEDULER = “scrapy_redis.scheduler.Scheduler” #如果打开的话,之前拍过的就不爬了。

Ensure all spiders share same duplicates filter through redis.

DUPEFILTER_CLASS = “scrapy_redis.dupefilter.RFPDupeFilter”

Don’t cleanup redis queues, allows to pause/resume crawls.

SCHEDULER_PERSIST = True

#REDIS_URL = ‘redis://127.0.0.1:6397’ #scrapy-redis end

Configure maximum concurrent requests performed by Scrapy (default: 16)

CONCURRENT_REQUESTS = 16

Configure a delay for requests for the same website (default: 0)

See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay

See also autothrottle settings and docs

DOWNLOAD_DELAY = 3

The download delay setting will honor only one of:

CONCURRENT_REQUESTS_PER_DOMAIN = 16 CONCURRENT_REQUESTS_PER_IP = 16

Disable cookies (enabled by default)

#COOKIES_ENABLED = False

Disable Telnet Console (enabled by default)

#TELNETCONSOLE_ENABLED = False

Override the default request headers:

#DEFAULT_REQUEST_HEADERS = {

‘Accept’: ’text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8’,

‘Accept-Language’: ’en’,

#}

Enable or disable spider middlewares

See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

SPIDER_MIDDLEWARES = { ‘doutula.middlewares.DoutulaSpiderMiddleware’: 543, } #redis有一个middleware的,543这种是优先级,数字越低,优先级越高

Enable or disable downloader middlewares

See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html

#DOWNLOADER_MIDDLEWARES = {

‘doutula.middlewares.MyCustomDownloaderMiddleware’: 543,

#}

Enable or disable extensions

See http://scrapy.readthedocs.org/en/latest/topics/extensions.html

#EXTENSIONS = {

‘scrapy.extensions.telnet.TelnetConsole’: None,

#}

Configure item pipelines

See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html

ITEM_PIPELINES = { ‘doutula.pipelines.DoutulaPipeline’: 300, #‘scrapy_redis.pipelines.RedisPipeline’: 400, } #pipeline也是有优先级的。

#下面是mongo db的配置,可以import使用 MONGO_URI = “localhost” MONGODB_PORT = 27017 MONGODB_DB = “yydb” MONGODB_COLLECTION = “pidc”

Enable and configure the AutoThrottle extension (disabled by default)

See http://doc.scrapy.org/en/latest/topics/autothrottle.html

#AUTOTHROTTLE_ENABLED = True

The initial download delay

#AUTOTHROTTLE_START_DELAY = 5

The maximum download delay to be set in case of high latencies

#AUTOTHROTTLE_MAX_DELAY = 60

The average number of requests Scrapy should be sending in parallel to

each remote server

#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

Enable showing throttling stats for every response received:

#AUTOTHROTTLE_DEBUG = False

Enable and configure HTTP caching (disabled by default)

See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

#HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = ‘httpcache’ #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = ‘scrapy.extensions.httpcache.FilesystemCacheStorage’ #想爬的更快,并且减轻网站压力,最好把上面的HTTPCACHE的注释都打开一下。

MONGO_URI = ’localhost’ MONGO_DATABASE = ’’ [/code]

(6)dtl.py 爬虫的文件名千万不能和项目的工程名字相同,否则可能会import失败。

我重命名的爬虫,这里主要是生成要爬的页面,并处理页面。通过异步回调来处理。 解析处理网页,可以用多种方式,比如re、beautifulsoup等,我这里用的是xpath。我还是习惯用正则,没感到xpath有多少优点。

例子中的xpath就不讲了,需要自学一下。

[code]

-- coding: utf-8 --

import scrapy

from scrapy.http import Request from doutula.items import DoutulaItem

class doutulaSpider(scrapy.Spider): name = ‘dtl’ allowed_domains = [‘doutula.com’] #允许爬的域名 #start_urls = [‘https://www.doutula.com/article/list/?page=1’] base_url = ‘https://www.doutula.com/article/list/?page=%d'

def start_requests(self):#重载这个函数,yield生成要爬的url
    for i in range(1, 541):
        url = self.base_url % (i)
        yield Request(url, self.parse)

def parse(self, response):
    item = DoutulaItem()
    pids = response.xpath('''//img[@class="lazy image_dtb img-responsive"]/@data-original''').extract()
    alts = response.xpath('''//img[@class="lazy image_dtb img-responsive"]/@alt''').extract()
    for loop,pid in enumerate(pids):
        item['pid']  = pid[pid.rfind('/')+1:pid.rfind('.')]
        item['name'] = alts[loop]
        yield item  #这里生成item,下一步就传到pipline那边处理了

[/code]

(7)pipelines.py 这个文件主要是用来持久化爬来的数据,你想存成什么样子都行,比如json格式

[code] class JsonPipeline(object): def open_spider(self, spider): self.file = open(‘items.json’, ‘w’) def close_spider(self, spider): self.file.close() def process_item(self, item, spider): if item[’name’] and item[‘pid’]: line = json.dumps(dict(item)) + “\n” self.file.write(line) return item [/code]

再举例一个mongodb的(需要安装monogodb),其实照着例子填就行。 [code] from scrapy.exceptions import DropItem import pymongo import json

class DoutulaPipeline(object): collection_name = ‘pidc’

def __init__(self, mongo_uri, mongo_db):
    self.mongo_uri = mongo_uri
    self.mongo_db = mongo_db
    self.ids_seen = set()
    self.names_seen = set()

@classmethod
def from_crawler(cls, crawler):
    return cls(
        mongo_uri=crawler.settings.get('MONGO_URI'),
        mongo_db=crawler.settings.get('MONGODB_DB', 'items')
    )

def open_spider(self, spider):
    self.client = pymongo.MongoClient(self.mongo_uri)
    self.db = self.client[self.mongo_db]

def close_spider(self, spider):
    self.client.close()

def process_item(self, item, spider):
    if len(item['pid']) <> 32:
        raise DropItem("Drop item found: %s" % item)
    else:
        if item['pid'] in self.ids_seen or item['name'] in self.names_seen:
            raise DropItem("Duplicate item found: %s" % item)
        else:
            self.ids_seen.add(item['pid'])
            self.names_seen.add(item['name'])
            self.db[self.collection_name].insert_one(dict(item))
            return item

[/code]

以上例子,需要先同时运行monodb服务和redis服务,最后再运行scrapy。如果觉得麻烦,可以注释掉monogodb和redis相关代码运行。