You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

215 lines
7.0 KiB

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

# milkSpider
selenium + redis + 分布式 + xpath + etree + 可视化
任务爬取京东网站上在售的各类牛奶品类的商品名称简介价格相关评论区相关。并给出相应的价格波动趋势精选好评用python的可视化展示。计划任务自动爬取。
![image-20220410095017421](README [Image]/image-20220410095017421.png)
![image-20220410095022817](README [Image]/image-20220410095022817.png)
## TODO
- [x] 初始化 selenium 框架,编写好相应的爬取规则,初步实现小规模爬取内容
- [ ] 考虑user-agentip池cookietoken实现更大规模爬取内容
- [ ] 从历史价格网页爬取历史价格,比对,给出价格波动趋势
- [x] 加入Redis分布式设计
- [ ] 数据可视化
- [ ] 定时,自动化爬取
## project
### 项目目录
> Selesium
>
> > downloader.py 下载器,即爬取内容
> >
> > middlewares.py 配置分布式线程redis相关内容
> >
> > pipelines.py 处理得到的数据,存储到相应文件
> >
> > milkSpider.py 主文件,配置爬取设置,自动化等
> >
> > items.py 暂定
### selenium
配置下载器利用selenium模拟浏览器正常浏览行为
```python
# -*- coding: utf-8 -*-
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from lxml import etree
def getsource(url):
init = Options()
init.add_argument('--no-sandbox')
init.add_argument('--headless')
init.add_argument('--disable-gpu')
init.add_argument("disable-cache")
init.add_argument('disable-infobars')
init.add_argument('log-level=3') # INFO = 0 WARNING = 1 LOG_ERROR = 2 LOG_FATAL = 3 default is 0
init.add_experimental_option("excludeSwitches",['enable-automation','enable-logging'])
driver = webdriver.Chrome(chrome_options = init)
driver.implicitly_wait(10)
driver.get(url)
response = etree.HTML(driver.page_source)
response = etree.tostring(response, encoding = "utf-8", pretty_print = True, method = "html")
response = response.decode('utf-8')
driver.close()
return response
```
## 安装,初始化
### GIT
```powershell
# 安装git
winget install --id Git.Git -e --source winget
## 或者官网下载
https://git-scm.com/download/win
# 在powershell中使用
vim $PROFILE
## 修改相应的位置为 GITPATH = ~/Git/cmd/git.exe
## SetAlias git $GITPATH
git init
git remote add origin https://bdgit.educoder.net/mf942lkca/milkSpider.git
git pull https://bdgit.educoder.net/mf942lkca/milkSpider.git
git remote -v # 查看远程仓库信息
touch .gitignore # 创建忽略上传控制文件
git commit -m "update" # 先添加一个commit
git add *.py # 添加要push的本地内容到一个本地临时仓库
git push -u origin master # push, 出错就 -f(注意会造成不可回避的损失)
```
### selenium
安装
```powershell
# 安装selenium
pip3 install selenium
# 查看配置信息
pip how selenium
```
调用时导入的内容
```python
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
# chrome_options.add_argument('lang=zh_CN.UTF-8') # 设置中文
chrome_options.add_argument('--headless') # 无界面
chrome_options.add_argument('--no-sandbox') # 解决DevToolsActivePort文件不存在报错问题
chrome_options.add_argument('--disable-gpu') # 禁用GPU硬件加速。如果软件渲染器没有就位则GPU进程将不会启动。
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--window-size=1920,1080') # 设置当前窗口的宽度和高度
driver = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
#driver = webdriver.Chrome()
url = ""
driver.get(url)
print(driver.page_source)
driver.quit()
```
一些备忘录
```python
text = """this is test content;这是测试内容。"""
html1 = etree.HTML(text)
# html1 = etree.fromstring(text) # 同HTML()
# 方法1 使用html.unescape()
res = etree.tostring(html1)
print(html.unescape(res.decode('utf-8')))
# 方法2 使用uft-8编码
res = etree.tostring(html1,encoding="utf-8") # 这种方法对标签用的中文属性无效
print(res.decode('utf-8'))
# 方法1 使用open读取文档做字符串处理
with open('test.html') as f:
html1 = etree.HTML(f.read())
# 之后代码同 处理字符串 的两种方法
# 方法2 parse读取文档时指定编码方式
html1 = etree.parse('test.html',etree.HTMLParser(encoding='utf-8'))
# 这里要指定正确(与所读取文档相应的编码)的编码方式,不然后面会出现乱码
# 之后代码同 处理字符串 的两种方法
```
请求头cookie等
```python
# 访问 https://httpbin.org/get?show_env=1 可以返回当前浏览器的请求信息
options.add_argument('lang=zh_CN.UTF-8')
```
ChromeDriver
下载 [ChromeDriver](https://chromedriver.chromium.org/home) 放到当前目录就行(如果是放在 python 根目录可以不用在实例化 selenium 时指定chromedriver 路径)
### Redis
[介绍,配置](C:\Users\wkyuu\Desktop\my\SQL\Redis\Redis - NoSql高速缓存数据库.md)
```python
# 安装 redis 模块
## pip install redis
# 实例对象
redisconn = redis.Redis(host = '127.0.0.1', port = '6379', password = 'x', db = 0)
# redis 取出的结果默认是字节,我们可以设定 decode_responses=True 改成字符串
```
## 备注
在没有使用线程之前,完整跑完五个种类共(30 x 10 x 5 = 1500)条数据用时365s
使用线程数为5的情况下完整跑完五个种类共 1500条数据用时130s
使用线程数为16的情况下完整跑完五个种类共 1500条数据用时80s
## 参考链接
1[selenium+python自动化100-centos上搭建selenium启动chrome浏览器headless无界面模式](https://www.cnblogs.com/yoyoketang/p/11582012.html)
2[解决:'chromedriver' executable needs to be in PATH问题](https://www.cnblogs.com/Neeo/articles/13949854.html)
3[Python selenium-chrome禁用打印日志输出](https://blog.csdn.net/wm9028/article/details/107536929)
4[Python将list逐行读入到csv文件中](https://blog.csdn.net/weixin_41068770/article/details/103145660)
5[Git中使用.gitignore忽略文件的推送](https://blog.csdn.net/lk142500/article/details/82869018)
6[python 3 实现定义跨模块的全局变量和使用](https://codeantenna.com/a/9YbdOKrrSJ)
7[Python 多线程](https://www.runoob.com/python/python-multithreading.html)
8[Python redis 使用介绍](https://www.runoob.com/w3cnote/python-redis-intro.html)
9[python + redis 实现 分布式队列任务](https://cloud.tencent.com/developer/article/1697383)
10[深入理解Python线程中join()函数](https://www.linuxidc.com/Linux/2019-03/157795.htm)
11[如何理解Python装饰器- 知乎](https://www.zhihu.com/question/26930016/answer/360300235)
12[【自动化】selenium设置请求头](https://www.jianshu.com/p/419eb4e00963)
13[https://blog.csdn.net/fox64194167/article/details/80542717](https://blog.csdn.net/fox64194167/article/details/80542717)
14