milkSpider/README.md

# milkSpider

selenium + redis + 分布式 + xpath + etree + 可视化

任务：爬取京东网站上在售的各类牛奶品类的商品名称，简介，价格相关。并给出相应的价格波动趋势，用python的可视化展示。计划任务自动爬取。

## TODO

-   [x] 初始化 selenium 框架，编写好相应的爬取规则，初步实现小规模爬取内容
-   [x] 从历史价格网页爬取历史价格
-   [x] 加入Redis分布式设计
-   [x] 数据可视化
    -   [x] 预计两种模式（终端交互）：随机或取评价数为索引目标，给出取出的item的具体信息，例如价格趋势
        -   [x] 选择目录，友好的选择交互体验
        -   [x] 选择主要参考方式（价格，评论）
-   [ ] python打包exe，需要图形化界面？

## project

### 项目目录

>   Selesium
>
>   >   downloader.py	下载器，即爬取内容
>   >
>   >   middlewares.py	配置分布式，线程，redis相关内容
>   >
>   >   pipelines.py	处理得到的数据，存储到相应文件
>   >
>   >   milkSpider.py	主文件，配置爬取设置，自动化等
>   >
>   >   historyPrice.py	爬取历史价格
>   >
>   >   view.py	读取并解析数据，配置可视化内容
>   >
>   >   settings.py	主要配置文件

## 安装，初始化

### GIT

```powershell
# 安装git
winget install --id Git.Git -e --source winget
## 或者官网下载
https://git-scm.com/download/win
# 在powershell中使用
vim $PROFILE
## 修改相应的位置为 GITPATH = ~/Git/cmd/git.exe
## SetAlias git $GITPATH

git init
git remote add origin https://bdgit.educoder.net/mf942lkca/milkSpider.git
git pull https://bdgit.educoder.net/mf942lkca/milkSpider.git
git remote -v	# 查看远程仓库信息
touch .gitignore	# 创建忽略上传控制文件

git add *.py	# 添加要push的本地内容到一个本地临时仓库
git commit -m "update"	# 先添加一个commit
git push -u origin master	# push, 出错就 -f(注意会造成不可回避的损失)
```

### selenium

配置下载器，利用selenium模拟浏览器正常浏览行为

安装

```powershell
# 安装selenium
pip3 install selenium

# 查看配置信息
pip how selenium
```

调用

```python
# -*- coding: utf-8 -*-
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from lxml import etree

def getsource(url):
    init = Options()

    init.add_argument('--no-sandbox')
    init.add_argument('--headless')
    init.add_argument('--disable-gpu')
    init.add_argument("disable-cache")
    init.add_argument('disable-infobars')
    init.add_argument('log-level=3')    # INFO = 0 WARNING = 1 LOG_ERROR = 2 LOG_FATAL = 3 default is 0
    init.add_experimental_option("excludeSwitches",['enable-automation','enable-logging'])

    driver = webdriver.Chrome(chrome_options = init)
    driver.implicitly_wait(10)
    driver.get(url)

    response = etree.HTML(driver.page_source)
    response = etree.tostring(response, encoding = "utf-8", pretty_print = True, method = "html")
    response = response.decode('utf-8')

    driver.close()
    return response
```

一些备忘录

```python
text = """this is test content;这是测试内容。"""
html1 = etree.HTML(text)
# html1 = etree.fromstring(text) # 同HTML()

# 方法1 使用html.unescape()
res = etree.tostring(html1)
print(html.unescape(res.decode('utf-8')))

# 方法2 使用uft-8编码
res = etree.tostring(html1,encoding="utf-8") # 这种方法对标签用的中文属性无效
print(res.decode('utf-8'))

# 方法1 使用open读取文档做字符串处理
with open('test.html') as f:
    html1 = etree.HTML(f.read())
# 之后代码同 处理字符串 的两种方法

# 方法2 parse读取文档时指定编码方式
html1 = etree.parse('test.html',etree.HTMLParser(encoding='utf-8'))
# 这里要指定正确（与所读取文档相应的编码）的编码方式，不然后面会出现乱码
# 之后代码同 处理字符串 的两种方法

```

请求头，cookie等

```python
# 访问 https://httpbin.org/get?show_env=1 可以返回当前浏览器的请求信息
options.add_argument('lang=zh_CN.UTF-8')

# 贴一个用json模块保存cookies
def getCookies():
    with open('cookies.json', 'r', encoding='utf-8') as fd:
            listCookies = json.loads(fd.read())
    for cookie in listCookies:
        cookies = {
            'domain': cookie['domain'],
            'httpOnly': cookie['httpOnly'],
            'name':cookie['name'],
            'path':'/',
            'secure': cookie['secure'],
            'value':cookie['value'],
        }
        print(cookies)

def saveCookies(driver):
    jsonCookies = json.dumps(driver.get_cookies())
    with open('cookies.json', 'w', encoding='utf-8') as fd:
        fd.write(jsonCookies)

```

ChromeDriver

下载 [ChromeDriver](https://chromedriver.chromium.org/home) 放到当前目录就行(如果是放在 python 根目录可以不用在实例化 selenium 时指定chromedriver 路径)

### Matplotlib

[python数据可视化，MatLab开源替代方案](https://www.runoob.com/numpy/numpy-matplotlib.html)

用pip管理器安装`pip install matplotlib`

```python
# 使用方法
import numpy as np
from matplotlib import pyplot as plt

x = np.arange(1,11)
y =  2  * x +  5
plt.title("Matplotlib demo")
plt.xlabel("x axis caption")
plt.ylabel("y axis caption")
plt.plot(x,y)
plt.show()
```

切换字体

```python
from matplotlib import pyplot as plt
import matplotlib

def getFont():  # 列出可用的字体
    font = sorted([f.name for f in matplotlib.font_manager.fontManager.ttflist])
    for i in font:
        print(i)
# getFont()
plt.rcParams['font.family'] = ['Microsoft YaHei']
```

### Pandas

```python
import pandas as pd
df = pd.read_csv(filename, encoding = 'utf-8', header = 0, error_bad_lines = False)

df.columns	# 查看所有列头的名字
df.xx	# 获得xx那一列的信息
df['xx']	# 同上
df.sort_values(by = 'xx', ascending = True)	# 按某一列排序
df.loc[index]	# 取index行全部数据
df.loc[index][index2]	# 取那行的某一数据
```


### Requests

经典老碟

```python
import requests

headers = { "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2486.0 Safari/537.36 Edge/13.10586"}
url = ""

session = requests.Session()
res = session.get(url, headers = headers)
# print(res.request.headers)
res.encoding = res.apparent_encoding	# 'utf-8'
print(res.text)
```

### 正则表达式

```python
# 完全匹配浮点数
reg = [-+]?[0-9]*\.?[0-9]*
```

### 线程

多线程，手动版

```python
import threading
import time

threadlines = 16    # 默认调用16个线程，不要超过20
flag = 1    # 判断主线程

def printTime(name):
    print("name", time.ctime())
    delay(4)
	print("name", time.ctime())

threads = []
for thread in range(threadlines):
    name = "thread " + str(thread)
    athread = printTime(name)
    athread.start()
    threads.append(athread)

for thread in threads:	# 加入阻塞，在子线程没完全结束前，保证主线程不断
    thread.join()
```

线程锁

```python
import threading
import time

threadLock = threading.Lock()
threadlines = 16    # 默认调用16个线程，不要超过20
flag = 1    # 判断主线程

def printTime(name):
    print("name", time.ctime())
    delay(4)
	print("newname", time.ctime())
    newtime = str(time.ctime())
    threadLock.acquire()	# 获得对txt文件的锁（独享操作权限）
    write2txt(newname)
    threadLock.release()	# 释放锁（把独享权限让出）

def write2txt(name):
    with open('test.txt', 'a+', encoding = 'utf-8') as fd:
        fd.write(name)

threads = []
for thread in range(4):
    name = "thread " + str(thread)
    athread = printTime(name)
    athread.start()
    threads.append(athread)

for thread in threads:	# 加入阻塞，在子线程没完全结束前，保证主线程不断
    thread.join()
```

线程池，建议用

```python
from concurrent.futures import ThreadPoolExecutor
import time

def printTime(name):
    print("name", time.ctime())
    delay(4)
	print("newname", time.ctime())

with ThreadPoolExecutor(max_workers = 10) as thread:
    for count in range(10):
        name = "thread" + str(count)
        task = thread.submit(printTime, (name))	# 传入函数和对应需要的参数
        print(task.done())	# 查看该线程是否完成，bool
        print(task.result())	# 返回上面 printTime 函数的返回值
```

### Redis

```python
# 安装 redis 模块
## pip install redis

# 实例对象
redisconn = redis.Redis(host = '127.0.0.1', port = '6379', password = 'x', db = 0)

# redis 取出的结果默认是字节，我们可以设定 decode_responses=True 改成字符串
```

## 备注

-   没有历史查询

在没有使用线程之前，完整跑完五个种类共(30 x 10 x 5 = 1500)条数据，用时365s

使用线程数为5的情况下，完整跑完五个种类共 1500条数据，用时130s

使用线程数为16的情况下，完整跑完五个种类共 1500条数据，用时80s


-   加了历史查询

在不使用线程池的情况下，完整跑完 1500条数据，用时很久

在使用线程池的情况下，完整跑完 1500条数据，用时544秒


-   目前已知问题
    -   在非windows环境下，打开可视化界面时会找不到字体，解决方法是修改 settings.py 中的字体为自己当前操作系统所有的字体。使用view.py中的getFont方法能列出当前系统所有的字体。

## 参考链接

1，[selenium+python自动化100-centos上搭建selenium启动chrome浏览器headless无界面模式](https://www.cnblogs.com/yoyoketang/p/11582012.html)

2，[解决：'chromedriver' executable needs to be in PATH问题](https://www.cnblogs.com/Neeo/articles/13949854.html)

3，[Python selenium-chrome禁用打印日志输出](https://blog.csdn.net/wm9028/article/details/107536929)

4，[Python将list逐行读入到csv文件中](https://blog.csdn.net/weixin_41068770/article/details/103145660)

5，[Git中使用.gitignore忽略文件的推送](https://blog.csdn.net/lk142500/article/details/82869018)

6，[python 3 实现定义跨模块的全局变量和使用](https://codeantenna.com/a/9YbdOKrrSJ)

7，[Python 多线程](https://www.runoob.com/python/python-multithreading.html)

8，[Python redis 使用介绍](https://www.runoob.com/w3cnote/python-redis-intro.html)

9，[python + redis 实现 分布式队列任务](https://cloud.tencent.com/developer/article/1697383)

10，[深入理解Python线程中join()函数](https://www.linuxidc.com/Linux/2019-03/157795.htm)

11，[如何理解Python装饰器？- 知乎](https://www.zhihu.com/question/26930016/answer/360300235)

12，[【自动化】selenium设置请求头](https://www.jianshu.com/p/419eb4e00963)

13，[python selenium 保存cookie 读取cookie](https://blog.csdn.net/fox64194167/article/details/80542717)

14，[Selenium：添加Cookie的方法](https://cloud.tencent.com/developer/article/1616175)

15，[requests库使用方法汇总笔记](https://wenku.baidu.com/view/fa71322401020740be1e650e52ea551810a6c928.html)

16，[爬虫：常见的HTTP错误代码及错误原因](https://blog.csdn.net/Smart_look/article/details/109967222)

17，[Python字符串操作之字符串分割与组合](https://blog.csdn.net/seetheworld518/article/details/47346527)

18，[python线程池](https://www.cnblogs.com/liyuanhong/p/15767817.html)

19，[python matplotlib坐标轴设置的方法](https://www.csdn.net/tags/NtzaUgxsOTQ2NjgtYmxvZwO0O0OO0O0O.html)

20，[史上最全！用Pandas读取CSV，看这篇就够了](https://cloud.tencent.com/developer/article/1856554)

21，[pandas数据处理的常用操作](https://zhuanlan.zhihu.com/p/29535766)

22，[★★pandas的数据输出显示设置](https://www.jianshu.com/p/5c0aa1fa19af)

23，[解决pandas：ValueError: Cannot convert non-finite values (NA or inf) to integer](https://blog.csdn.net/zhongkeyuanchongqing/article/details/123599260)

24，[pandas取dataframe特定行/列](https://www.cnblogs.com/nxf-rabbit75/p/10105271.html)

25，[Pandas 获取DataFrame 的行索引和列索引](https://blog.csdn.net/YENTERTAINR/article/details/109254583)

26，