目的

Scrapy 是一個強大通用型框架，但是資料一旦多了起來，就採用多機器進行加速爬取，但是 Scrapy不支持分散式，Scrapy-Redis 就因此而誕生，假設一個簡單的情形有100個 urls 會需要爬取，一個機器爬100次，兩個機器各爬50次，時間隨機器的增加而線性減少，我們只要可以有效地分配100個 urls 給多數機器就可以達到分散式爬取，Redis就是處理分配 urls 的任務。

網路上相關的介紹和原理非常多，但是詳細的實例卻很少或是常常有錯，因此本篇 blog 主要為介紹從零到一建出一個分散式爬取的簡單例子。

安裝Redis並測試遠端連線

a 機器 ip: 10.2.0.10
b 機器
兩台機器都需要安裝 Redis

sudo apt install redis-server

然後 a 機器需要設定外連ip

vim /etc/redis.conf

找到這行 -> #bind 127.0.0.1
修改為你自己的ip -> bind 10.2.0.10
重啟 Redis

sudo /etc/init.d/redis-server restart

b 機器測試能不能連過去

redis-cli -h 10.2.0.10

如果出現，代表連線成功(6379是預設的port)

10.2.0.10:6379

實戰 (蘋果日報)

url是 http://www.appledaily.com.tw/realtimenews/section/new/1 從1到10
要抓 list 的 title

也會進到內文中抓 content 的 title
總共抓取即時資訊的10頁，每一頁的標題和點進去標題後內文的標題

先安裝 scrapy 和 scrapy-redis

pip install scrapy

pip install scrapy-redis

scrapy 是一定要安裝的，scrapy-redis 則是改造了collection.deque，變成用redis來分配urls。
scrapy 不太好裝，不同OS和版本都會有不同問題，網路上資源也蠻多的但就要自己慢慢修好，裝起來也先跑跑看example，等到確定可以 work 再來以下的範例。

開始建立 scrapy project 下

scrapy startproject apple

進到剛建立好的 project

cd apple

看一下結構

tree

以下修改的方式都是參考 Scrapy-Redis 的 example
python 版本皆為 2.7
首先编辑 settings 文件 (settings.py):：

# -*- coding: utf-8 -*-

# Scrapy settings for example project

# For simplicity, this file contains only the most important settings by

# default. All the other settings are documented here:

#     http://doc.scrapy.org/topics/settings.html

BOT_NAME = 'apple'
SPIDER_MODULES = ['apple.spiders']
NEWSPIDER_MODULE = 'apple.spiders'
USER_AGENT = 'scrapy-redis (+https://github.com/rolando/scrapy-redis)'
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# SCHEDULER_PERSIST 設定是否中斷後會繼續下載 (測試的時候最好條為 False，不然抓一次抓完就不會跑了)

SCHEDULER_PERSIST = False
# 判斷從Redis取出urls時的方式 PriorityQueue, Queue, Stack

SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderPriorityQueue"
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderQueue"

#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderStack"

ITEM_PIPELINES = {
    'scrapy_redis.pipelines.RedisPipeline': 400
}
# 調整LOG的形式，可改為 INFO ERROR WARNING...

LOG_LEVEL = 'DEBUG'
# Introduce an artifical delay to make use of parallelism. to speed up the

# crawl.

# 設定延遲時間

DOWNLOAD_DELAY = 1
# Redis 的 ip (此範例為 a機器)

REDIS_HOST = '10.2.0.10'
REDIS_PORT = 6379

編輯 items (items.py):

# Define here the models for your scraped items

# See documentation in:

# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy
class AppleItem(scrapy.Item):
    # define the fields for your item here like:

    # name = scrapy.Field()

    title = scrapy.Field()

在 spiders 的資料夾中建立 apple.py ，爬蟲的主體

# -*- coding: utf-8 -*-

from scrapy_redis.spiders import RedisCrawlSpider
from ..items import AppleItem
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import scrapy
from bs4 import BeautifulSoup

class AppleSpider(RedisCrawlSpider):
    name = 'apple'
    redis_key = 'apple'
    # LinkExtractor 可以給爬取urls制定規則，[1-10] 數字1,2,3...9,10

    rules = [Rule(LinkExtractor(allow=('/realtimenews/section/new/[1-10]$')),callback='parse_list',follow=True)]
    def parse_list(self, response):
        domain = 'http://www.appledaily.com.tw'
        res = BeautifulSoup(response.text)
        for news in res.select('.rtddt'):
            list_title = news.h1.string.encode('utf-8')
            print list_title
            # 傳入 list 的 url 給 parse_detail 去 parse content title

            yield scrapy.Request(domain + news.select('a')[0]['href'], self.parse_detail)
    def parse_detail(self, response):
        res = BeautifulSoup(response.text)
        content_title = res.find("h1", {"id": "h1"}).string.encode('utf-8')
        print content_title
        appleitem = AppleItem()
        appleitem['title'] = res.find("h1", {"id": "h1"}).string
        return appleitem

把 a 機器的整個 project 複製到 b 機器上

scp -r apple/ user@b機器ip:your path/apple/

分別進到兩機器

cd apple

執行 (兩台都要輸入這個指令)
scrapy crawl apple
這時兩個爬蟲都會啟動，呈現待機狀態，因為目前 Redis 沒有 url，因此要 push 進去
b 機器執行把 url push 到 redis 中

redis-cli -h 10.2.0.10 lpush apple http://appledaily.com.tw/realtimenews/section/new/

然後就可以看到兩個爬蟲開始爬了，仔細看一下 print 的過程，會發現爬取的東西都不一樣就代表成功了！
或是要輸出 json 來驗證也可以
只要把指令改成

scrapy crawl apple -o apple.json

兩台機器存的內容也會不一樣，或是第一次開一台，第二次開兩台，把第兩次的檔案合併會等於第一次爬取的檔案內容。

結尾

注意：

RedisSpider 不需要寫 start_urls (很多教學寫分散式但是都沒用到這個 RedisSpider 有點傻眼，一般的 scrapy 是用這個 CrawlSpider，要啟用 Redis ，就要用 RedisSpider (apple.py裡面))
必須指定 redis_key (apple.py裡面)，爬蟲才會去讀取 Redis 這個 key 中存的值，並根據指令的 key ，由 redis-cli -h 10.2.0.10 lpush key start_urls。
爬取的地方，我有點偷懶使用 BeautifulSoup ，官網就有直白的說 BeautifulSoup 就是慢，建議使用內建的 xpath, css selector...
附上 scrapy_redis_example

Scrapy 和 Redis 分散式crawlers (蘋果日報為例)

目的

安裝Redis並測試遠端連線

實戰 (蘋果日報)

結尾

wutienyang

你可能感興趣的文章

留言討論

Scrapy 和 Redis 分散式crawlers (蘋果日報為例)

目的

安裝Redis並測試遠端連線

實戰 (蘋果日報)

結尾

wutienyang

你可能感興趣的文章

只是看看 JSON 檔案格式

[筆記] React 隨手記 （環境建置、常用功能說明）

[Note] webpack5 problem: Environment Variable (.env)

留言討論

[筆記] React 隨手記（環境建置、常用功能說明）