# milkSpider selenium + redis + 分布式 + xpath + etree + 可视化 任务:爬取京东网站上在售的各类牛奶品类的商品名称,简介,价格相关,评论区相关。并给出相应的价格波动趋势,精选好评,用python的可视化展示。计划任务自动爬取。 ## TODO - [x] 初始化 selenium 框架,编写好相应的爬取规则,初步实现小规模爬取内容 - [x] 从历史价格网页爬取历史价格 - [x] 加入Redis分布式设计 - [x] 数据可视化 - [ ] 预计两种模式(终端交互):随机或取评价数为索引目标,给出取出的item的具体信息,例如价格趋势 - [ ] 选择目录,友好的选择交互体验 - [ ] 选择抽取item模式(热评就列出前五条,随机就随机取一条) - [ ] python打包exe,需要图形化界面? ## project ### 项目目录 > Selesium > > > downloader.py 下载器,即爬取内容 > > > > middlewares.py 配置分布式,线程,redis相关内容 > > > > pipelines.py 处理得到的数据,存储到相应文件 > > > > milkSpider.py 主文件,配置爬取设置,自动化等 > > > > historyPrice.py 爬取历史价格 > > > > view.py 读取并解析数据,配置可视化内容 > > > > settings.py 主要配置文件 ## 安装,初始化 ### GIT ```powershell # 安装git winget install --id Git.Git -e --source winget ## 或者官网下载 https://git-scm.com/download/win # 在powershell中使用 vim $PROFILE ## 修改相应的位置为 GITPATH = ~/Git/cmd/git.exe ## SetAlias git $GITPATH git init git remote add origin https://bdgit.educoder.net/mf942lkca/milkSpider.git git pull https://bdgit.educoder.net/mf942lkca/milkSpider.git git remote -v # 查看远程仓库信息 touch .gitignore # 创建忽略上传控制文件 git add *.py # 添加要push的本地内容到一个本地临时仓库 git commit -m "update" # 先添加一个commit git push -u origin master # push, 出错就 -f(注意会造成不可回避的损失) ``` ### selenium 配置下载器,利用selenium模拟浏览器正常浏览行为 安装 ```powershell # 安装selenium pip3 install selenium # 查看配置信息 pip how selenium ``` 调用 ```python # -*- coding: utf-8 -*- from selenium import webdriver from selenium.webdriver.chrome.options import Options from lxml import etree def getsource(url): init = Options() init.add_argument('--no-sandbox') init.add_argument('--headless') init.add_argument('--disable-gpu') init.add_argument("disable-cache") init.add_argument('disable-infobars') init.add_argument('log-level=3') # INFO = 0 WARNING = 1 LOG_ERROR = 2 LOG_FATAL = 3 default is 0 init.add_experimental_option("excludeSwitches",['enable-automation','enable-logging']) driver = webdriver.Chrome(chrome_options = init) driver.implicitly_wait(10) driver.get(url) response = etree.HTML(driver.page_source) response = etree.tostring(response, encoding = "utf-8", pretty_print = True, method = "html") response = response.decode('utf-8') driver.close() return response ``` 一些备忘录 ```python text = """this is test content;这是测试内容。""" html1 = etree.HTML(text) # html1 = etree.fromstring(text) # 同HTML() # 方法1 使用html.unescape() res = etree.tostring(html1) print(html.unescape(res.decode('utf-8'))) # 方法2 使用uft-8编码 res = etree.tostring(html1,encoding="utf-8") # 这种方法对标签用的中文属性无效 print(res.decode('utf-8')) # 方法1 使用open读取文档做字符串处理 with open('test.html') as f: html1 = etree.HTML(f.read()) # 之后代码同 处理字符串 的两种方法 # 方法2 parse读取文档时指定编码方式 html1 = etree.parse('test.html',etree.HTMLParser(encoding='utf-8')) # 这里要指定正确(与所读取文档相应的编码)的编码方式,不然后面会出现乱码 # 之后代码同 处理字符串 的两种方法 ``` 请求头,cookie等 ```python # 访问 https://httpbin.org/get?show_env=1 可以返回当前浏览器的请求信息 options.add_argument('lang=zh_CN.UTF-8') # 贴一个用json模块保存cookies def getCookies(): with open('cookies.json', 'r', encoding='utf-8') as fd: listCookies = json.loads(fd.read()) for cookie in listCookies: cookies = { 'domain': cookie['domain'], 'httpOnly': cookie['httpOnly'], 'name':cookie['name'], 'path':'/', 'secure': cookie['secure'], 'value':cookie['value'], } print(cookies) def saveCookies(driver): jsonCookies = json.dumps(driver.get_cookies()) with open('cookies.json', 'w', encoding='utf-8') as fd: fd.write(jsonCookies) ``` ChromeDriver 下载 [ChromeDriver](https://chromedriver.chromium.org/home) 放到当前目录就行(如果是放在 python 根目录可以不用在实例化 selenium 时指定chromedriver 路径) ### Matplotlib [python数据可视化,MatLab开源替代方案](https://www.runoob.com/numpy/numpy-matplotlib.html) 用pip管理器安装`pip install matplotlib` ```python # 使用方法 import numpy as np from matplotlib import pyplot as plt x = np.arange(1,11) y = 2 * x + 5 plt.title("Matplotlib demo") plt.xlabel("x axis caption") plt.ylabel("y axis caption") plt.plot(x,y) plt.show() ``` 切换字体 ```python from matplotlib import pyplot as plt import matplotlib def getFont(): # 列出可用的字体 font = sorted([f.name for f in matplotlib.font_manager.fontManager.ttflist]) for i in font: print(i) # getFont() plt.rcParams['font.family'] = ['Microsoft YaHei'] ``` ### Requests 经典老碟 ```python import requests headers = { "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2486.0 Safari/537.36 Edge/13.10586"}
url = ""
session = requests.Session()
res = session.get(url, headers = headers)
# print(res.request.headers)
res.encoding = res.apparent_encoding # 'utf-8'
print(res.text)
```

### 正则表达式

```python
# 完全匹配浮点数
reg = [-+]?[0-9]*\.?[0-9]*
```

### 线程

多线程,手动版

```python
import threading
import time

threadlines = 16 # 默认调用16个线程,不要超过20
flag = 1 # 判断主线程

def printTime(name):
    print("name", time.ctime())
    delay(4)
    print("name", time.ctime())

threads = []
for thread in range(threadlines):
    name = "thread " + str(thread)
    athread = printTime(name)
    athread.start()
    threads.append(athread)

for thread in threads: # 加入阻塞,在子线程没完全结束前,保证主线程不断
    thread.join()
```

线程锁

```python
import threading
import time

threadLock = threading.Lock()
threadlines = 16 # 默认调用16个线程,不要超过20
flag = 1 # 判断主线程

def printTime(name):
    print("name", time.ctime())
    delay(4)
    print("newname", time.ctime())
    newtime = str(time.ctime())
    threadLock.acquire() # 获得对txt文件的锁(独享操作权限)
    write2txt(newname)
    threadLock.release() # 释放锁(把独享权限让出)

def write2txt(name):
    with open('test.txt', 'a+', encoding = 'utf-8') as fd:
        fd.write(name)

threads = []
for thread in range(4):
    name = "thread " + str(thread)
    athread = printTime(name)
    athread.start()
    threads.append(athread)

for thread in threads: # 加入阻塞,在子线程没完全结束前,保证主线程不断
    thread.join()
```

线程池,建议用

```python
from concurrent.futures import ThreadPoolExecutor
import time

def printTime(name):
    print("name", time.ctime())
    delay(4)
    print("newname", time.ctime())

with ThreadPoolExecutor(max_workers = 10) as thread:
    for count in range(10):
        name = "thread" + str(count)
        task = thread.submit(printTime, (name)) # 传入函数和对应需要的参数
        print(task.done()) # 查看该线程是否完成,bool
        print(task.result()) # 返回上面 printTime 函数的返回值
```

### Redis

```python
# 安装 redis 模块
## pip install redis

# 实例对象
redisconn = redis.Redis(host = '', port = '6379', password = 'x', db = 0)
# redis 取出的结果默认是字节,我们可以设定 decode_responses=True 改成字符串
```

## 备注

- 没有历史查询

在没有使用线程之前,完整跑完五个种类共(30 x 10 x 5 = 1500)条数据,用时365s

使用线程数为5的情况下,完整跑完五个种类共 1500条数据,用时130s

使用线程数为16的情况下,完整跑完五个种类共 1500条数据,用时80s

- 加了历史查询

在不使用线程池的情况下,完整跑完 1500条数据,用时很久

在使用线程池的情况下,完整跑完 1500条数据,用时544秒