|
|
|
|
# 分布式爬虫系统
|
|
|
|
|
|
|
|
|
|
## 下载&安装
|
|
|
|
|
|
|
|
|
|
### 爬虫
|
|
|
|
|
|
|
|
|
|
#### 安装selenium
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
pip3 install selenium
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
#### 安装 mysql,pymysql 并配置
|
|
|
|
|
|
|
|
|
|
#### 下载edge浏览器引擎
|
|
|
|
|
|
|
|
|
|
https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/
|
|
|
|
|
|
|
|
|
|
data:image/s3,"s3://crabby-images/1e86d/1e86dcce6b26917de266920b91197bc02989f2a5" alt="img"
|
|
|
|
|
|
|
|
|
|
浏览器 --> 设置 --> 关于 Microsoft Edge --> 版本信息。和上面对应(浏览器图标也要对应上,是这个带 绿色 的)
|
|
|
|
|
|
|
|
|
|
data:image/s3,"s3://crabby-images/ce348/ce348ef9d335f9eadcd0ebdd7860fa3f17299000" alt="img"
|
|
|
|
|
|
|
|
|
|
把下载的浏览器引擎程序放在 dcs/bin 目录下
|
|
|
|
|
|
|
|
|
|
可以用下面的脚本测试
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
from time import sleep
|
|
|
|
|
from selenium import webdriver
|
|
|
|
|
|
|
|
|
|
driverfile_path = r'G:\Users\god\PycharmProjects\dcs\bin\msedgedriver.exe'
|
|
|
|
|
driver = webdriver.Edge(executable_path=driverfile_path)
|
|
|
|
|
|
|
|
|
|
driver.get(r'https://www.baidu.com/')
|
|
|
|
|
|
|
|
|
|
sleep(5)
|
|
|
|
|
driver.close()
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
上面的路径需要自己对应改一下
|
|
|
|
|
|
|
|
|
|
## 运行
|
|
|
|
|
|
|
|
|
|
python3 运行 main.py 文件,开启 server、spider、user_process、requester、communicate 五个服务线程,分布式爬虫系统服务端开始运行和监测。
|
|
|
|
|
|
|
|
|
|
node 运行 login.js,即可开启web服务器,可接收浏览器请求,之后与爬虫服务器通信,取得结果后返回浏览器。
|
|
|
|
|
|
|
|
|
|
再运行 client.py 文件,运行客户端,开始请求爬虫任务,服务端即可接收、分配并执行、组合,最终返回结果到客户端。
|
|
|
|
|
|
|
|
|
|
## 运行截图
|
|
|
|
|
|
|
|
|
|
data:image/s3,"s3://crabby-images/3bb0e/3bb0ef63fa746be5b8af5cd983bdbe21e8e31a24" alt="image-20220421204241089"
|
|
|
|
|
|
|
|
|
|
data:image/s3,"s3://crabby-images/f4281/f42813e35540ab3cb3b349ad3be52dcff301e464" alt="image-20220421204341598"
|
|
|
|
|
|
|
|
|
|
data:image/s3,"s3://crabby-images/70d4e/70d4e4783053a470389a5dcc7f949267474aa539" alt="image-20220421204402347"
|
|
|
|
|
|
|
|
|
|
## 项目结构图
|
|
|
|
|
|
|
|
|
|
data:image/s3,"s3://crabby-images/5ac02/5ac02eb3ff00e69f8893157b45515412684ad043" alt="image-20220421204402357"
|
|
|
|
|
|
|
|
|
|
## 服务器运行日志
|
|
|
|
|
|
|
|
|
|
> https://code.educoder.net/attachments/entries/get_file?download_url=https://code.educoder.net/api/p3t2ja9zs/dcs/raw?filepath=dcs/dcs.log&ref=master
|
|
|
|
|
|
|
|
|
|
## 更新日志
|
|
|
|
|
|
|
|
|
|
## V1.0
|
|
|
|
|
|
|
|
|
|
基本框架搭建完毕,实现核心的类“P2P”机制
|