You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
dcs/README.md

72 lines
2.4 KiB

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

# 分布式爬虫系统
## 下载&安装
### 爬虫
#### 安装selenium
```bash
pip3 install selenium
```
#### 安装 mysqlpymysql 并配置
#### 下载edge浏览器引擎
https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/
![img](https://img-blog.csdnimg.cn/20201014171452760.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3RrMTAyMw==,size_16,color_FFFFFF,t_70)
浏览器 --> 设置 --> 关于 Microsoft Edge --> 版本信息。和上面对应(浏览器图标也要对应上,是这个带 绿色 的)
![img](https://img-blog.csdnimg.cn/20201014171642418.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3RrMTAyMw==,size_16,color_FFFFFF,t_70)
把下载的浏览器引擎程序放在 dcs/bin 目录下
可以用下面的脚本测试
```python
from time import sleep
from selenium import webdriver
driverfile_path = r'G:\Users\god\PycharmProjects\dcs\bin\msedgedriver.exe'
driver = webdriver.Edge(executable_path=driverfile_path)
driver.get(r'https://www.baidu.com/')
sleep(5)
driver.close()
```
上面的路径需要自己对应改一下
## 运行
python3 运行 main.py 文件,开启 server、spider、user_process、requester、communicate 五个服务线程,分布式爬虫系统服务端开始运行和监测。
node 运行 login.js即可开启web服务器可接收浏览器请求之后与爬虫服务器通信取得结果后返回浏览器。
再运行 client.py 文件,运行客户端,开始请求爬虫任务,服务端即可接收、分配并执行、组合,最终返回结果到客户端。
## 运行截图
![image-20220421204241089](https://code.educoder.net/repo/p3t2ja9zs/dcs/raw/branch/master/docs/pictures/server_start.png)
![image-20220421204341598](https://code.educoder.net/repo/p3t2ja9zs/dcs/raw/branch/master/docs/pictures/server_running.png)
![image-20220421204402347](https://code.educoder.net/repo/p3t2ja9zs/dcs/raw/branch/master/docs/pictures/client_result.png)
## 项目结构图
![image-20220421204402357](https://code.educoder.net/repo/p3t2ja9zs/dcs/raw/branch/master/docs/pictures/CRAWL_SERVER.jpg)
## 服务器运行日志
> https://code.educoder.net/attachments/entries/get_file?download_url=https://code.educoder.net/api/p3t2ja9zs/dcs/raw?filepath=dcs/dcs.log&ref=master
## 更新日志
## V1.0
基本框架搭建完毕实现核心的类“P2P”机制