|
|
# milkSpider
|
|
|
|
|
|
selenium + redis + 分布式 + xpath + etree + 可视化
|
|
|
|
|
|
任务:爬取京东网站上在售的各类牛奶品类的商品名称,简介,价格相关。并给出相应的价格波动趋势,用python的可视化展示。计划任务自动爬取。
|
|
|
|
|
|
## TODO
|
|
|
|
|
|
- [x] 初始化 selenium 框架,编写好相应的爬取规则,初步实现小规模爬取内容
|
|
|
- [x] 从历史价格网页爬取历史价格
|
|
|
- [x] 加入Redis分布式设计
|
|
|
- [x] 数据可视化
|
|
|
- [x] 预计两种模式(终端交互):随机或取评价数为索引目标,给出取出的item的具体信息,例如价格趋势
|
|
|
- [x] 选择目录,友好的选择交互体验
|
|
|
- [x] 选择主要参考方式(价格,评论)
|
|
|
- [ ] python打包exe,需要图形化界面?
|
|
|
|
|
|
## project
|
|
|
|
|
|
### 项目目录
|
|
|
|
|
|
> Selesium
|
|
|
>
|
|
|
> > downloader.py 下载器,即爬取内容
|
|
|
> >
|
|
|
> > middlewares.py 配置分布式,线程,redis相关内容
|
|
|
> >
|
|
|
> > pipelines.py 处理得到的数据,存储到相应文件
|
|
|
> >
|
|
|
> > milkSpider.py 主文件,配置爬取设置,自动化等
|
|
|
> >
|
|
|
> > historyPrice.py 爬取历史价格
|
|
|
> >
|
|
|
> > view.py 读取并解析数据,配置可视化内容
|
|
|
> >
|
|
|
> > settings.py 主要配置文件
|
|
|
|
|
|
## 安装,初始化
|
|
|
|
|
|
### GIT
|
|
|
|
|
|
```powershell
|
|
|
# 安装git
|
|
|
winget install --id Git.Git -e --source winget
|
|
|
## 或者官网下载
|
|
|
https://git-scm.com/download/win
|
|
|
# 在powershell中使用
|
|
|
vim $PROFILE
|
|
|
## 修改相应的位置为 GITPATH = ~/Git/cmd/git.exe
|
|
|
## SetAlias git $GITPATH
|
|
|
|
|
|
git init
|
|
|
git remote add origin https://bdgit.educoder.net/mf942lkca/milkSpider.git
|
|
|
git pull https://bdgit.educoder.net/mf942lkca/milkSpider.git
|
|
|
git remote -v # 查看远程仓库信息
|
|
|
touch .gitignore # 创建忽略上传控制文件
|
|
|
|
|
|
git add *.py # 添加要push的本地内容到一个本地临时仓库
|
|
|
git commit -m "update" # 先添加一个commit
|
|
|
git push -u origin master # push, 出错就 -f(注意会造成不可回避的损失)
|
|
|
```
|
|
|
|
|
|
### selenium
|
|
|
|
|
|
配置下载器,利用selenium模拟浏览器正常浏览行为
|
|
|
|
|
|
安装
|
|
|
|
|
|
```powershell
|
|
|
# 安装selenium
|
|
|
pip3 install selenium
|
|
|
|
|
|
# 查看配置信息
|
|
|
pip how selenium
|
|
|
```
|
|
|
|
|
|
调用
|
|
|
|
|
|
```python
|
|
|
# -*- coding: utf-8 -*-
|
|
|
from selenium import webdriver
|
|
|
from selenium.webdriver.chrome.options import Options
|
|
|
from lxml import etree
|
|
|
|
|
|
def getsource(url):
|
|
|
init = Options()
|
|
|
|
|
|
init.add_argument('--no-sandbox')
|
|
|
init.add_argument('--headless')
|
|
|
init.add_argument('--disable-gpu')
|
|
|
init.add_argument("disable-cache")
|
|
|
init.add_argument('disable-infobars')
|
|
|
init.add_argument('log-level=3') # INFO = 0 WARNING = 1 LOG_ERROR = 2 LOG_FATAL = 3 default is 0
|
|
|
init.add_experimental_option("excludeSwitches",['enable-automation','enable-logging'])
|
|
|
|
|
|
driver = webdriver.Chrome(chrome_options = init)
|
|
|
driver.implicitly_wait(10)
|
|
|
driver.get(url)
|
|
|
|
|
|
response = etree.HTML(driver.page_source)
|
|
|
response = etree.tostring(response, encoding = "utf-8", pretty_print = True, method = "html")
|
|
|
response = response.decode('utf-8')
|
|
|
|
|
|
driver.close()
|
|
|
return response
|
|
|
```
|
|
|
|
|
|
一些备忘录
|
|
|
|
|
|
```python
|
|
|
text = """this is test content;这是测试内容。"""
|
|
|
html1 = etree.HTML(text)
|
|
|
# html1 = etree.fromstring(text) # 同HTML()
|
|
|
|
|
|
# 方法1 使用html.unescape()
|
|
|
res = etree.tostring(html1)
|
|
|
print(html.unescape(res.decode('utf-8')))
|
|
|
|
|
|
# 方法2 使用uft-8编码
|
|
|
res = etree.tostring(html1,encoding="utf-8") # 这种方法对标签用的中文属性无效
|
|
|
print(res.decode('utf-8'))
|
|
|
|
|
|
# 方法1 使用open读取文档做字符串处理
|
|
|
with open('test.html') as f:
|
|
|
html1 = etree.HTML(f.read())
|
|
|
# 之后代码同 处理字符串 的两种方法
|
|
|
|
|
|
# 方法2 parse读取文档时指定编码方式
|
|
|
html1 = etree.parse('test.html',etree.HTMLParser(encoding='utf-8'))
|
|
|
# 这里要指定正确(与所读取文档相应的编码)的编码方式,不然后面会出现乱码
|
|
|
# 之后代码同 处理字符串 的两种方法
|
|
|
|
|
|
```
|
|
|
|
|
|
请求头,cookie等
|
|
|
|
|
|
```python
|
|
|
# 访问 https://httpbin.org/get?show_env=1 可以返回当前浏览器的请求信息
|
|
|
options.add_argument('lang=zh_CN.UTF-8')
|
|
|
|
|
|
# 贴一个用json模块保存cookies
|
|
|
def getCookies():
|
|
|
with open('cookies.json', 'r', encoding='utf-8') as fd:
|
|
|
listCookies = json.loads(fd.read())
|
|
|
for cookie in listCookies:
|
|
|
cookies = {
|
|
|
'domain': cookie['domain'],
|
|
|
'httpOnly': cookie['httpOnly'],
|
|
|
'name':cookie['name'],
|
|
|
'path':'/',
|
|
|
'secure': cookie['secure'],
|
|
|
'value':cookie['value'],
|
|
|
}
|
|
|
print(cookies)
|
|
|
|
|
|
def saveCookies(driver):
|
|
|
jsonCookies = json.dumps(driver.get_cookies())
|
|
|
with open('cookies.json', 'w', encoding='utf-8') as fd:
|
|
|
fd.write(jsonCookies)
|
|
|
|
|
|
```
|
|
|
|
|
|
ChromeDriver
|
|
|
|
|
|
下载 [ChromeDriver](https://chromedriver.chromium.org/home) 放到当前目录就行(如果是放在 python 根目录可以不用在实例化 selenium 时指定chromedriver 路径)
|
|
|
|
|
|
### Matplotlib
|
|
|
|
|
|
[python数据可视化,MatLab开源替代方案](https://www.runoob.com/numpy/numpy-matplotlib.html)
|
|
|
|
|
|
用pip管理器安装`pip install matplotlib`
|
|
|
|
|
|
```python
|
|
|
# 使用方法
|
|
|
import numpy as np
|
|
|
from matplotlib import pyplot as plt
|
|
|
|
|
|
x = np.arange(1,11)
|
|
|
y = 2 * x + 5
|
|
|
plt.title("Matplotlib demo")
|
|
|
plt.xlabel("x axis caption")
|
|
|
plt.ylabel("y axis caption")
|
|
|
plt.plot(x,y)
|
|
|
plt.show()
|
|
|
```
|
|
|
|
|
|
切换字体
|
|
|
|
|
|
```python
|
|
|
from matplotlib import pyplot as plt
|
|
|
import matplotlib
|
|
|
|
|
|
def getFont(): # 列出可用的字体
|
|
|
font = sorted([f.name for f in matplotlib.font_manager.fontManager.ttflist])
|
|
|
for i in font:
|
|
|
print(i)
|
|
|
# getFont()
|
|
|
plt.rcParams['font.family'] = ['Microsoft YaHei']
|
|
|
```
|
|
|
|
|
|
### Pandas
|
|
|
|
|
|
```python
|
|
|
import pandas as pd
|
|
|
df = pd.read_csv(filename, encoding = 'utf-8', header = 0, error_bad_lines = False)
|
|
|
|
|
|
df.columns # 查看所有列头的名字
|
|
|
df.xx # 获得xx那一列的信息
|
|
|
df['xx'] # 同上
|
|
|
df.sort_values(by = 'xx', ascending = True) # 按某一列排序
|
|
|
df.loc[index] # 取index行全部数据
|
|
|
df.loc[index][index2] # 取那行的某一数据
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### Requests
|
|
|
|
|
|
经典老碟
|
|
|
|
|
|
```python
|
|
|
import requests
|
|
|
|
|
|
headers = { "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2486.0 Safari/537.36 Edge/13.10586"}
|
|
|
url = ""
|
|
|
|
|
|
session = requests.Session()
|
|
|
res = session.get(url, headers = headers)
|
|
|
# print(res.request.headers)
|
|
|
res.encoding = res.apparent_encoding # 'utf-8'
|
|
|
print(res.text)
|
|
|
```
|
|
|
|
|
|
### 正则表达式
|
|
|
|
|
|
```python
|
|
|
# 完全匹配浮点数
|
|
|
reg = [-+]?[0-9]*\.?[0-9]*
|
|
|
```
|
|
|
|
|
|
### 线程
|
|
|
|
|
|
多线程,手动版
|
|
|
|
|
|
```python
|
|
|
import threading
|
|
|
import time
|
|
|
|
|
|
threadlines = 16 # 默认调用16个线程,不要超过20
|
|
|
flag = 1 # 判断主线程
|
|
|
|
|
|
def printTime(name):
|
|
|
print("name", time.ctime())
|
|
|
delay(4)
|
|
|
print("name", time.ctime())
|
|
|
|
|
|
threads = []
|
|
|
for thread in range(threadlines):
|
|
|
name = "thread " + str(thread)
|
|
|
athread = printTime(name)
|
|
|
athread.start()
|
|
|
threads.append(athread)
|
|
|
|
|
|
for thread in threads: # 加入阻塞,在子线程没完全结束前,保证主线程不断
|
|
|
thread.join()
|
|
|
```
|
|
|
|
|
|
线程锁
|
|
|
|
|
|
```python
|
|
|
import threading
|
|
|
import time
|
|
|
|
|
|
threadLock = threading.Lock()
|
|
|
threadlines = 16 # 默认调用16个线程,不要超过20
|
|
|
flag = 1 # 判断主线程
|
|
|
|
|
|
def printTime(name):
|
|
|
print("name", time.ctime())
|
|
|
delay(4)
|
|
|
print("newname", time.ctime())
|
|
|
newtime = str(time.ctime())
|
|
|
threadLock.acquire() # 获得对txt文件的锁(独享操作权限)
|
|
|
write2txt(newname)
|
|
|
threadLock.release() # 释放锁(把独享权限让出)
|
|
|
|
|
|
def write2txt(name):
|
|
|
with open('test.txt', 'a+', encoding = 'utf-8') as fd:
|
|
|
fd.write(name)
|
|
|
|
|
|
threads = []
|
|
|
for thread in range(4):
|
|
|
name = "thread " + str(thread)
|
|
|
athread = printTime(name)
|
|
|
athread.start()
|
|
|
threads.append(athread)
|
|
|
|
|
|
for thread in threads: # 加入阻塞,在子线程没完全结束前,保证主线程不断
|
|
|
thread.join()
|
|
|
```
|
|
|
|
|
|
线程池,建议用
|
|
|
|
|
|
```python
|
|
|
from concurrent.futures import ThreadPoolExecutor
|
|
|
import time
|
|
|
|
|
|
def printTime(name):
|
|
|
print("name", time.ctime())
|
|
|
delay(4)
|
|
|
print("newname", time.ctime())
|
|
|
|
|
|
with ThreadPoolExecutor(max_workers = 10) as thread:
|
|
|
for count in range(10):
|
|
|
name = "thread" + str(count)
|
|
|
task = thread.submit(printTime, (name)) # 传入函数和对应需要的参数
|
|
|
print(task.done()) # 查看该线程是否完成,bool
|
|
|
print(task.result()) # 返回上面 printTime 函数的返回值
|
|
|
```
|
|
|
|
|
|
### Redis
|
|
|
|
|
|
```python
|
|
|
# 安装 redis 模块
|
|
|
## pip install redis
|
|
|
|
|
|
# 实例对象
|
|
|
redisconn = redis.Redis(host = '127.0.0.1', port = '6379', password = 'x', db = 0)
|
|
|
|
|
|
# redis 取出的结果默认是字节,我们可以设定 decode_responses=True 改成字符串
|
|
|
```
|
|
|
|
|
|
## 备注
|
|
|
|
|
|
- 没有历史查询
|
|
|
|
|
|
在没有使用线程之前,完整跑完五个种类共(30 x 10 x 5 = 1500)条数据,用时365s
|
|
|
|
|
|
使用线程数为5的情况下,完整跑完五个种类共 1500条数据,用时130s
|
|
|
|
|
|
使用线程数为16的情况下,完整跑完五个种类共 1500条数据,用时80s
|
|
|
|
|
|
|
|
|
|
|
|
- 加了历史查询
|
|
|
|
|
|
在不使用线程池的情况下,完整跑完 1500条数据,用时很久
|
|
|
|
|
|
在使用线程池的情况下,完整跑完 1500条数据,用时544秒
|
|
|
|
|
|
|
|
|
|
|
|
- 目前已知问题
|
|
|
- 在非windows环境下,打开可视化界面时会找不到字体,解决方法是修改 settings.py 中的字体为自己当前操作系统所有的字体。使用view.py中的getFont方法能列出当前系统所有的字体。
|
|
|
|
|
|
## 参考链接
|
|
|
|
|
|
1,[selenium+python自动化100-centos上搭建selenium启动chrome浏览器headless无界面模式](https://www.cnblogs.com/yoyoketang/p/11582012.html)
|
|
|
|
|
|
2,[解决:'chromedriver' executable needs to be in PATH问题](https://www.cnblogs.com/Neeo/articles/13949854.html)
|
|
|
|
|
|
3,[Python selenium-chrome禁用打印日志输出](https://blog.csdn.net/wm9028/article/details/107536929)
|
|
|
|
|
|
4,[Python将list逐行读入到csv文件中](https://blog.csdn.net/weixin_41068770/article/details/103145660)
|
|
|
|
|
|
5,[Git中使用.gitignore忽略文件的推送](https://blog.csdn.net/lk142500/article/details/82869018)
|
|
|
|
|
|
6,[python 3 实现定义跨模块的全局变量和使用](https://codeantenna.com/a/9YbdOKrrSJ)
|
|
|
|
|
|
7,[Python 多线程](https://www.runoob.com/python/python-multithreading.html)
|
|
|
|
|
|
8,[Python redis 使用介绍](https://www.runoob.com/w3cnote/python-redis-intro.html)
|
|
|
|
|
|
9,[python + redis 实现 分布式队列任务](https://cloud.tencent.com/developer/article/1697383)
|
|
|
|
|
|
10,[深入理解Python线程中join()函数](https://www.linuxidc.com/Linux/2019-03/157795.htm)
|
|
|
|
|
|
11,[如何理解Python装饰器?- 知乎](https://www.zhihu.com/question/26930016/answer/360300235)
|
|
|
|
|
|
12,[【自动化】selenium设置请求头](https://www.jianshu.com/p/419eb4e00963)
|
|
|
|
|
|
13,[python selenium 保存cookie 读取cookie](https://blog.csdn.net/fox64194167/article/details/80542717)
|
|
|
|
|
|
14,[Selenium:添加Cookie的方法](https://cloud.tencent.com/developer/article/1616175)
|
|
|
|
|
|
15,[requests库使用方法汇总笔记](https://wenku.baidu.com/view/fa71322401020740be1e650e52ea551810a6c928.html)
|
|
|
|
|
|
16,[爬虫:常见的HTTP错误代码及错误原因](https://blog.csdn.net/Smart_look/article/details/109967222)
|
|
|
|
|
|
17,[Python字符串操作之字符串分割与组合](https://blog.csdn.net/seetheworld518/article/details/47346527)
|
|
|
|
|
|
18,[python线程池](https://www.cnblogs.com/liyuanhong/p/15767817.html)
|
|
|
|
|
|
19,[python matplotlib坐标轴设置的方法](https://www.csdn.net/tags/NtzaUgxsOTQ2NjgtYmxvZwO0O0OO0O0O.html)
|
|
|
|
|
|
20,[史上最全!用Pandas读取CSV,看这篇就够了](https://cloud.tencent.com/developer/article/1856554)
|
|
|
|
|
|
21,[pandas数据处理的常用操作](https://zhuanlan.zhihu.com/p/29535766)
|
|
|
|
|
|
22,[★★pandas的数据输出显示设置](https://www.jianshu.com/p/5c0aa1fa19af)
|
|
|
|
|
|
23,[解决pandas:ValueError: Cannot convert non-finite values (NA or inf) to integer](https://blog.csdn.net/zhongkeyuanchongqing/article/details/123599260)
|
|
|
|
|
|
24,[pandas取dataframe特定行/列](https://www.cnblogs.com/nxf-rabbit75/p/10105271.html)
|
|
|
|
|
|
25,[Pandas 获取DataFrame 的行索引和列索引](https://blog.csdn.net/YENTERTAINR/article/details/109254583)
|
|
|
|
|
|
26, |