Scrapy基础

安装Twisted

Windows

Unofficial Windows Binaries for Python Extension Packages

CentOS

1
2
3
4
5
6
$ cd /usr/local
$ wget https://twistedmatrix.com/Releases/Twisted/18.7/Twisted-18.7.0.tar.bz2
$ yum install -y bzip2
$ tar -jxvf Twisted-18.7.0.tar.bz2
$ cd Twisted-18.7.0/
$ python3 setup.py install

安装scrapy

CentOS

1
2
$ pip3 install scrapy
$ pip3 install -i https://pypi.douban.com/simple scrapy

MacOS

1
2
$ xcode-select --install
$ pip3 install scrapy

scrapy shell

1
2
3
4
5
$ scrapy shell url
# 启用交互式调试
$ scrapy shell url -pdb
>>> response.xpath('//h1/text()').extract()
>>> response.css('.ad-price').xpath('text()').re('[.0-9]+')

创建项目

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
$ scrapy startproject projectName
# 示例
$ scrapy startproject scrapy1
$ cd scrapy1
$ tree
.
├── scrapy.cfg
└── scrapy1
├── __init__.py
├── __pycache__
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
└── spiders
├── __init__.py
└── __pycache__

4 directories, 7 files

比较重要的3个文件:

  • item.py
  • pipelines.py
  • settings.py

创建爬虫

查看可用的爬虫模板

1
2
3
4
5
6
$ scrapy genspider -l
Available templates:
basic
crawl
csvfeed
xmlfeed

基于模板创建爬虫

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
$ scrapy genspider -t 爬虫米板名称 爬虫名称 web域名
# 示例
$ scrapy genspider basic web
$ tree
.
├── scrapy.cfg
└── scrapy1
├── __init__.py
├── __pycache__
│   ├── __init__.cpython-36.pyc
│   └── settings.cpython-36.pyc
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
└── spiders
├── __init__.py
├── __pycache__
│   └── __init__.cpython-36.pyc
└── basic.py

4 directories, 11 files

爬虫调试

1
$ scrapy parse --spider=basic url

运行Scrapy

1
$ scrapy runspider quotes_spider.py -o quotes.json

保存文件

1
2
3
$ scrapy crawl basic -o item.json
$ scrapy crawl basic -o item.jl
$ scrapy crawl basic -o item.csv

其它参数

1
2
# 抓到90条数据就终止
-s CLOSESPIDER_ITEMCOUNT=90

创建contract

contract有点像为爬虫设计的单元测试,它可以让你快速知道哪里有运行异常。

配置文件

scrapy.py

settings.py

配置User Agent

1
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1'

禁用蜘蛛协议

1
2
# 默认是True
ROBOTSTXT_OBEY = False

配置并发量

1
2
# 默认值 16
CONCURRENT_REQUESTS = 32

配置下载延迟

1
2
# 默认值 0
DOWNLOAD_DELAY = 0.5

配置每域名并发量

1
CONCURRENT_REQUESTS_PER_DOMAIN = 16

配置每IP并发量

1
CONCURRENT_REQUESTS_PER_IP = 16

配置是否开启Cookie

1
2
# 默认值 True
COOKIES_ENABLED = False

配置是否进行Telnet

1
2
# 默认值 True
TELNETCONSOLE_ENABLED = False

配置默认请求头

1
2
3
4
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
}

配置中间键

1
2
3
SPIDER_MIDDLEWARES = {
'scrapysolution.middlewares.ScrapysolutionSpiderMiddleware': 543,
}
  • 数值越小,优先级越高
  • 数值不能相同

配置下载中间键

1
2
3
DOWNLOADER_MIDDLEWARES = {
'scrapysolution.middlewares.ScrapysolutionDownloaderMiddleware': 543,
}

配置扩展中间键

1
2
3
EXTENSIONS = {
'scrapy.extensions.telnet.TelnetConsole': None,
}

配置PipeLines中间键

1
2
3
ITEM_PIPELINES = {
'scrapysolution.pipelines.ScrapysolutionPipeline': 300,
}

配置认证

1
2
3
4
5
6
7
8
9
10
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

配置缓存

1
2
3
4
5
6
7
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

Scrapyd

安装

安装服务端

1
2
$ pip3 install scrapyd
$ scrapyd

安装客户端

1
$ pip3 install scrapy-client

配置

配置文件路径:/usr/local/lib/python3.6/site-packages/scrapyd/default_scrapyd.conf

管理

查看服务状态

1
$ curl http://localhost:6800/daemonstatus.json

部署项目

1
$ scrapyd-deploy localhost -p scrapysolution

删除项目

1
$ curl http://localhost:6800/delproject.json -d project=scrapysolution

启动Job

1
$ curl http://localhost:6800/schedule.json -d project=myproject -d spider=somespider

停止Job

1
$ curl http://localhost:6800/cancel.json -d project=myproject -d job=6487ec79947edab326d6db28a2d86511e8247444

获得项目列表信息

1
$ curl http://localhost:6800/listprojects.json

获得某一个项目信息

1
$ curl http://localhost:6800/listversions.json?project=myproject

获得某一个项目的爬虫列表信息

1
$ curl http://localhost:6800/listspiders.json?project=myproject

获得某一个项目的Job列表

1
$ curl http://localhost:6800/listjobs.json?project=myproject | python -m json.tool

删除项目版本

如果一个给定项目没有更多可用的版本,该项目也将被删除。

1
$ curl http://localhost:6800/delversion.json -d project=myproject -d version=r99

scrapy-splash

安装scrapy-splash

1
$ pip3 install scrapy-splash

运行splash instance

1
2
3
4
5
6
7
8
9
10
11
# 拉取镜像
$ sudo docker pull scrapinghub/splash
# 运行容器
$ sudo docker run -p 8050:8050 scrapinghub/splash --restart=always --maxrss 4000
$ sudo docker run -d -p 8050:8050 --memory=4.5G --restart=always scrapinghub/splash --maxrss 4000
# 现在可以通过0.0.0.0:8050(http),8051(https),5023 (telnet)来访问Splash了。
# 禁用私人模式。在启用私有模式时,WebKit localStorage不起作用,并且无法为localStorage提供JavaScript填充程序。
$ sudo docker run -p 8050:8050 scrapinghub/splash --disable-private-mode
$ sudo docker run -d -p 8050:8050 --memory=4.5G --restart=always scrapinghub/splash --maxrss 4000
$ sudo docker run -d -p 8050:8050 --memory=10G --restart=always scrapinghub/splash --maxrss 4000
$ sudo docker run -d -p 8050:8050 --memory=4.5G --restart=always scrapinghub/splash:3.1 --maxrss 4000

配置scrapy-splash

setting.py关键内容:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

DOWNLOADER_MIDDLEWARES = {
'scrapysolution.middlewares.UserAgentMiddleware': 726,
'scrapysolution.middlewares.WanDouProxyMiddleware': 727,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

SPLASH_URL = 'http://localhost:8050'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

使用scrapy-splash

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
script = """
function main(splash, args)
assert(splash:go(args.url))
assert(splash:wait(args.wait))
assert(splash:go(args.url))
assert(splash:wait(args.wait))
js = string.format("document.querySelector('#pager > a.pTag.last').click()")
splash:evaljs(js)
assert(splash:wait(args.wait))
return splash:html()
end
"""

historyListFullUrl = urljoin(base_url, historyListUrl)
yield scrapy.Request(historyListFullUrl, self.fetchCreateHistoryBySplash, dont_filter=True, meta={
'splash': {
'args': {
# set rendering arguments here
'html': 1,
'png': 1,
'lua_source': script,
'wait': 0.2,
'url': historyListFullUrl
},
'endpoint': 'execute'
}
})

回收内存

1
$ curl -X POST http://localhost:8050/_gc

参考

坚持原创技术分享,您的支持将鼓励我继续创作!