2018年1月15日 星期一 晴
一、安装 pip install Scrapy 不知道PIP请自行搜索,我安装比较顺利,环境是win7 64位,python是2.7 32位。
二、本地使用和例子 本文以爬doutula.com的表情ID和标签为例,用scrapy爬一下。
- 建立工程 在命令行下,运行scrapy startproject doutula 其中,doutula是网站的名字。运行命令后会建立一个新的工程,我们往里面填就是了。
我本地的框架如下 go.py —我自己写的本地运行的命令 scrapy.cfg —网络部署时的配置 doutula spiders – init.py – dtl.py —框架生成后改名的爬虫,名字不要和工程名字一样 init.py items.py —框架生成,有点类似models.py middlewares.py —框架生成,scrapy的中间件 pipelines.py —框架生成,scrapy用来持久化数据的部分 settings.py —框架生成,scrapy的设置 ualist.py —我新增的,里面就是一个user-agent的list
下面逐个讲述具体每个文件的作用 (1). go.py 就改第三个参数就行了,也就是爬虫的名字 [code] #! /usr/bin/env python #coding=utf-8 from scrapy.cmdline import execute execute([‘scrapy’,‘crawl’,‘dtl’]) [/code]
(2)scrapy.cfg 看看就行,一般不用改
(3)items.py 有点类似models.py,依葫芦画瓢填写要爬的内容即可 [code] import scrapy class DoutulaItem(scrapy.Item): name = scrapy.Field() pid = scrapy.Field() [/code]
(4)middlewares.py 中间件,看看有什么可以在这里改的。我在这里尝试随机更换了user agent,是可行的。 [code] def process_start_requests(self, start_requests, spider): # Called with the start requests of the spider, and works # similarly to the process_spider_output() method, except # that it doesn’t have a response associated.
# Must return only requests (not items).
for r in start_requests:
r.headers["User-Agent"] = getRandomUA()#yobin added 20180108,这里仅供参考,注释掉吧
#print r.headers["User-Agent"]
yield r
[/code]
(5)settings.py 爬虫的配置文件,很重要。 对了,我安装了scrapy-redis,可以github搜索一下。scrapy-redis可以去重之类的,反正好处多多。
我的配置文件如下: [code] BOT_NAME = ‘doutula’
SPIDER_MODULES = [‘doutula.spiders’] NEWSPIDER_MODULE = ‘doutula.spiders’
Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = ‘User-Agent",“Mozilla/4.0 (compatible; MSIE 7.1; Windows NT 5.1; SV1)’
Obey robots.txt rules
ROBOTSTXT_OBEY = False
#scrapy-redis begin
Enables scheduling storing requests queue in redis.
#SCHEDULER = “scrapy_redis.scheduler.Scheduler” #如果打开的话,之前拍过的就不爬了。
Ensure all spiders share same duplicates filter through redis.
DUPEFILTER_CLASS = “scrapy_redis.dupefilter.RFPDupeFilter”
Don’t cleanup redis queues, allows to pause/resume crawls.
SCHEDULER_PERSIST = True
#REDIS_URL = ‘redis://127.0.0.1:6397’ #scrapy-redis end
Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 16
Configure a delay for requests for the same website (default: 0)
See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
See also autothrottle settings and docs
DOWNLOAD_DELAY = 3
The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 16 CONCURRENT_REQUESTS_PER_IP = 16
Disable cookies (enabled by default)
#COOKIES_ENABLED = False
Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
‘Accept’: ’text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8’,
‘Accept-Language’: ’en’,
#}
Enable or disable spider middlewares
See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = { ‘doutula.middlewares.DoutulaSpiderMiddleware’: 543, } #redis有一个middleware的,543这种是优先级,数字越低,优先级越高
Enable or disable downloader middlewares
See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
‘doutula.middlewares.MyCustomDownloaderMiddleware’: 543,
#}
Enable or disable extensions
See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
‘scrapy.extensions.telnet.TelnetConsole’: None,
#}
Configure item pipelines
See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = { ‘doutula.pipelines.DoutulaPipeline’: 300, #‘scrapy_redis.pipelines.RedisPipeline’: 400, } #pipeline也是有优先级的。
#下面是mongo db的配置,可以import使用 MONGO_URI = “localhost” MONGODB_PORT = 27017 MONGODB_DB = “yydb” MONGODB_COLLECTION = “pidc”
Enable and configure the AutoThrottle extension (disabled by default)
See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
The average number of requests Scrapy should be sending in parallel to
each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
Enable and configure HTTP caching (disabled by default)
See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = ‘httpcache’ #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = ‘scrapy.extensions.httpcache.FilesystemCacheStorage’ #想爬的更快,并且减轻网站压力,最好把上面的HTTPCACHE的注释都打开一下。
MONGO_URI = ’localhost’ MONGO_DATABASE = ’’ [/code]
(6)dtl.py 爬虫的文件名千万不能和项目的工程名字相同,否则可能会import失败。
我重命名的爬虫,这里主要是生成要爬的页面,并处理页面。通过异步回调来处理。 解析处理网页,可以用多种方式,比如re、beautifulsoup等,我这里用的是xpath。我还是习惯用正则,没感到xpath有多少优点。
例子中的xpath就不讲了,需要自学一下。
[code]
-- coding: utf-8 --
import scrapy
from scrapy.http import Request from doutula.items import DoutulaItem
class doutulaSpider(scrapy.Spider): name = ‘dtl’ allowed_domains = [‘doutula.com’] #允许爬的域名 #start_urls = [‘https://www.doutula.com/article/list/?page=1’] base_url = ‘https://www.doutula.com/article/list/?page=%d'
def start_requests(self):#重载这个函数,yield生成要爬的url
for i in range(1, 541):
url = self.base_url % (i)
yield Request(url, self.parse)
def parse(self, response):
item = DoutulaItem()
pids = response.xpath('''//img[@class="lazy image_dtb img-responsive"]/@data-original''').extract()
alts = response.xpath('''//img[@class="lazy image_dtb img-responsive"]/@alt''').extract()
for loop,pid in enumerate(pids):
item['pid'] = pid[pid.rfind('/')+1:pid.rfind('.')]
item['name'] = alts[loop]
yield item #这里生成item,下一步就传到pipline那边处理了
[/code]
(7)pipelines.py 这个文件主要是用来持久化爬来的数据,你想存成什么样子都行,比如json格式
[code] class JsonPipeline(object): def open_spider(self, spider): self.file = open(‘items.json’, ‘w’) def close_spider(self, spider): self.file.close() def process_item(self, item, spider): if item[’name’] and item[‘pid’]: line = json.dumps(dict(item)) + “\n” self.file.write(line) return item [/code]
再举例一个mongodb的(需要安装monogodb),其实照着例子填就行。 [code] from scrapy.exceptions import DropItem import pymongo import json
class DoutulaPipeline(object): collection_name = ‘pidc’
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
self.ids_seen = set()
self.names_seen = set()
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGODB_DB', 'items')
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
if len(item['pid']) <> 32:
raise DropItem("Drop item found: %s" % item)
else:
if item['pid'] in self.ids_seen or item['name'] in self.names_seen:
raise DropItem("Duplicate item found: %s" % item)
else:
self.ids_seen.add(item['pid'])
self.names_seen.add(item['name'])
self.db[self.collection_name].insert_one(dict(item))
return item
[/code]
以上例子,需要先同时运行monodb服务和redis服务,最后再运行scrapy。如果觉得麻烦,可以注释掉monogodb和redis相关代码运行。
...