- 安装Scrapy
- 创建一个Scrapy项目
- 编写一个爬虫Spider
- 运行Scrapy
- 定义保存爬取到的数据的容器Item
- 提取Item
Scrapy是一个为了爬取网站数据,提取结构性数据而编写的Python应用框架。
详情见 http://scrapy.org/
文档:http://doc.scrapy.org/en/latest/
中文文档:http://scrapy-chs.readthedocs.io/zh_CN/latest/
Scrapy例子项目:https://github.com/lijiancheng0614/ScrapyExamples
安装Scrapy
安装了python后只需在命令行中输入以下命令:
创建一个Scrapy项目
新建一个叫tutorial
的Scrapy项目,在命令行中输入:
1
| scrapy startproject tutorial
|
得到tutorial
目录:
1 2 3 4 5 6 7 8 9 10
| tutorial/ scrapy.cfg # deploy configuration file tutorial/ # project's Python module, you'll import your code from here __init__.py items.py # project items file pipelines.py # project pipelines file settings.py # project settings file spiders/ # a directory where you'll later put your spiders __init__.py ...
|
编写一个爬虫Spider
创建tutorial/spiders/dmoz_spider.py
(或在命令行中输入scrapy genspider dmoz dmoz.org
),输入以下内容:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
| import scrapy
class DmozSpider(scrapy.Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ]
def parse(self, response): filename = response.url.split("/")[-2] + '.html' with open(filename, 'wb') as f: f.write(response.body)
|
运行Scrapy
在tutorial
下命令行中输入:
得到类似的输出:
1 2 3 4 5 6 7 8 9 10 11
| 2014-01-23 18:13:07-0400 [scrapy] INFO: Scrapy started (bot: tutorial) 2014-01-23 18:13:07-0400 [scrapy] INFO: Optional features available: ... 2014-01-23 18:13:07-0400 [scrapy] INFO: Overridden settings: {} 2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled extensions: ... 2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled downloader middlewares: ... 2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled spider middlewares: ... 2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled item pipelines: ... 2014-01-23 18:13:07-0400 [scrapy] INFO: Spider opened 2014-01-23 18:13:08-0400 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None) 2014-01-23 18:13:09-0400 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None) 2014-01-23 18:13:09-0400 [scrapy] INFO: Closing spider (finished)
|
并在tutorial
目录中得到Resources.html
和Books.html
文件。
定义保存爬取到的数据的容器Item
编辑tutorial/items.py
,输入以下内容:
1 2 3 4 5 6
| import scrapy
class DmozItem(scrapy.Item): title = scrapy.Field() link = scrapy.Field() desc = scrapy.Field()
|
提取Item
从网页中提取数据有很多方法。Scrapy使用了一种基于XPath和CSS表达式机制:Scrapy Selectors。关于selector和其他提取机制的信息请参考Selector文档。
修改tutorial/spiders/dmoz_spider.py
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
| import scrapy
from tutorial.items import DmozItem
class DmozSpider(scrapy.Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ]
def parse(self, response): for sel in response.xpath('//ul/li'): item = DmozItem() item['title'] = sel.xpath('a/text()').extract() item['link'] = sel.xpath('a/@href').extract() item['desc'] = sel.xpath('text()').extract() yield item
|
在tutorial
下命令行中输入:
1 2
| scrapy crawl dmoz -o items.json scrapy crawl dmoz -o items.csv
|
即可得到相应的json文件和csv文件。