milkSpider/README.md

# milkSpider

selenium + redis + 分布式 + xpath + etree + 可视化

任务：爬取京东网站上在售的各类牛奶品类的商品名称，简介，价格相关，评论区相关。并给出相应的价格波动趋势，精选好评，用python的可视化展示。计划任务自动爬取。

![image-20220410095017421](README [Image]/image-20220410095017421.png)

![image-20220410095022817](README [Image]/image-20220410095022817.png)

## TODO

-   [x] 初始化 selenium 框架，编写好相应的爬取规则，初步实现小规模爬取内容
-   [ ] 考虑user-agent，ip池，cookie，token，实现更大规模爬取内容
-   [ ] 从历史价格网页爬取历史价格，比对，给出价格波动趋势
-   [x] 加入Redis分布式设计
-   [ ] 数据可视化
-   [ ] 定时，自动化爬取

## project

### 项目目录

>   Selesium
>
>   >   downloader.py	下载器，即爬取内容
>   >
>   >   middlewares.py	配置分布式，线程，redis相关内容
>   >
>   >   pipelines.py	处理得到的数据，存储到相应文件
>   >
>   >   milkSpider.py	主文件，配置爬取设置，自动化等
>   >
>   >   items.py	暂定

### selenium

配置下载器，利用selenium模拟浏览器正常浏览行为

```python
# -*- coding: utf-8 -*-
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from lxml import etree

def getsource(url):
    init = Options()

    init.add_argument('--no-sandbox')
    init.add_argument('--headless')
    init.add_argument('--disable-gpu')
    init.add_argument("disable-cache")
    init.add_argument('disable-infobars')
    init.add_argument('log-level=3')    # INFO = 0 WARNING = 1 LOG_ERROR = 2 LOG_FATAL = 3 default is 0
    init.add_experimental_option("excludeSwitches",['enable-automation','enable-logging'])

    driver = webdriver.Chrome(chrome_options = init)
    driver.implicitly_wait(10)
    driver.get(url)

    response = etree.HTML(driver.page_source)
    response = etree.tostring(response, encoding = "utf-8", pretty_print = True, method = "html")
    response = response.decode('utf-8')

    driver.close()
    return response
```

## 安装，初始化

### GIT

```powershell
# 安装git
winget install --id Git.Git -e --source winget
## 或者官网下载
https://git-scm.com/download/win
# 在powershell中使用
vim $PROFILE
## 修改相应的位置为 GITPATH = ~/Git/cmd/git.exe
## SetAlias git $GITPATH

git init
git remote add origin https://bdgit.educoder.net/mf942lkca/milkSpider.git
git pull https://bdgit.educoder.net/mf942lkca/milkSpider.git
git remote -v	# 查看远程仓库信息
touch .gitignore	# 创建忽略上传控制文件

git commit -m "update"	# 先添加一个commit
git add *.py	# 添加要push的本地内容到一个本地临时仓库
git push -u origin master	# push, 出错就 -f(注意会造成不可回避的损失)
```

### selenium

安装

```powershell
# 安装selenium
pip3 install selenium

# 查看配置信息
pip how selenium
```

调用时导入的内容

```python
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument('--headless')  # 无界面
chrome_options.add_argument('--no-sandbox')  # 解决DevToolsActivePort文件不存在报错问题
chrome_options.add_argument('--disable-gpu')   # 禁用GPU硬件加速。如果软件渲染器没有就位，则GPU进程将不会启动。
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--window-size=1920,1080')  # 设置当前窗口的宽度和高度
driver = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
#driver = webdriver.Chrome()
url = ""
driver.get(url)
print(driver.page_source)
driver.quit()
```

一些备忘录

```python
text = """this is test content;这是测试内容。"""
html1 = etree.HTML(text)
# html1 = etree.fromstring(text) # 同HTML()

# 方法1 使用html.unescape()
res = etree.tostring(html1)
print(html.unescape(res.decode('utf-8')))

# 方法2 使用uft-8编码
res = etree.tostring(html1,encoding="utf-8") # 这种方法对标签用的中文属性无效
print(res.decode('utf-8'))

# 方法1 使用open读取文档做字符串处理
with open('test.html') as f:
    html1 = etree.HTML(f.read())
# 之后代码同 处理字符串 的两种方法

# 方法2 parse读取文档时指定编码方式
html1 = etree.parse('test.html',etree.HTMLParser(encoding='utf-8'))
# 这里要指定正确（与所读取文档相应的编码）的编码方式，不然后面会出现乱码
# 之后代码同 处理字符串 的两种方法

```

### ChromeDriver

下载 [ChromeDriver](https://chromedriver.chromium.org/home) 放到 python 根目录就行

### Redis

[介绍，配置](C:\Users\wkyuu\Desktop\my\SQL\Redis\Redis - NoSql高速缓存数据库.md)

```python
# 安装 redis 模块
## pip install redis

# 实例对象
redisconn = redis.Redis(host = '127.0.0.1', port = '6379', password = 'x', db = 0)

# redis 取出的结果默认是字节，我们可以设定 decode_responses=True 改成字符串
```

## 备注

在没有使用线程之前，完整跑完五个种类共(30 x 10 x 5 = 1500)条数据，用时365s

使用线程数为5的情况下，完整跑完五个种类共 1500条数据，用时130s

使用线程数为16的情况下，完整跑完五个种类共 1500条数据，用时80s

## 参考链接

1，[selenium+python自动化100-centos上搭建selenium启动chrome浏览器headless无界面模式](https://www.cnblogs.com/yoyoketang/p/11582012.html)

2，[解决：'chromedriver' executable needs to be in PATH问题](https://www.cnblogs.com/Neeo/articles/13949854.html)

3，[Python selenium-chrome禁用打印日志输出](https://blog.csdn.net/wm9028/article/details/107536929)

4，[Python将list逐行读入到csv文件中](https://blog.csdn.net/weixin_41068770/article/details/103145660)

5，[Git中使用.gitignore忽略文件的推送](https://blog.csdn.net/lk142500/article/details/82869018)

6，[python 3 实现定义跨模块的全局变量和使用](https://codeantenna.com/a/9YbdOKrrSJ)

7，[Python 多线程](https://www.runoob.com/python/python-multithreading.html)

8，[Python redis 使用介绍](https://www.runoob.com/w3cnote/python-redis-intro.html)

9，[python + redis 实现 分布式队列任务](https://cloud.tencent.com/developer/article/1697383)

10，[深入理解Python线程中join()函数](https://www.linuxidc.com/Linux/2019-03/157795.htm)

11，