You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

408 lines
12 KiB

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

# milkSpider
selenium + redis + 分布式 + xpath + etree + 可视化
任务爬取京东网站上在售的各类牛奶品类的商品名称简介价格相关。并给出相应的价格波动趋势用python的可视化展示。计划任务自动爬取。
## TODO
- [x] 初始化 selenium 框架,编写好相应的爬取规则,初步实现小规模爬取内容
- [x] 从历史价格网页爬取历史价格
- [x] 加入Redis分布式设计
- [x] 数据可视化
- [x] 预计两种模式终端交互随机或取评价数为索引目标给出取出的item的具体信息例如价格趋势
- [x] 选择目录,友好的选择交互体验
- [x] 选择主要参考方式(价格,评论)
- [ ] python打包exe需要图形化界面
## project
### 项目目录
> Selesium
>
> > downloader.py 下载器,即爬取内容
> >
> > middlewares.py 配置分布式线程redis相关内容
> >
> > pipelines.py 处理得到的数据,存储到相应文件
> >
> > milkSpider.py 主文件,配置爬取设置,自动化等
> >
> > historyPrice.py 爬取历史价格
> >
> > view.py 读取并解析数据,配置可视化内容
> >
> > settings.py 主要配置文件
## 安装,初始化
### GIT
```powershell
# 安装git
winget install --id Git.Git -e --source winget
## 或者官网下载
https://git-scm.com/download/win
# 在powershell中使用
vim $PROFILE
## 修改相应的位置为 GITPATH = ~/Git/cmd/git.exe
## SetAlias git $GITPATH
git init
git remote add origin https://bdgit.educoder.net/mf942lkca/milkSpider.git
git pull https://bdgit.educoder.net/mf942lkca/milkSpider.git
git remote -v # 查看远程仓库信息
touch .gitignore # 创建忽略上传控制文件
git add *.py # 添加要push的本地内容到一个本地临时仓库
git commit -m "update" # 先添加一个commit
git push -u origin master # push, 出错就 -f(注意会造成不可回避的损失)
```
### selenium
配置下载器利用selenium模拟浏览器正常浏览行为
安装
```powershell
# 安装selenium
pip3 install selenium
# 查看配置信息
pip how selenium
```
调用
```python
# -*- coding: utf-8 -*-
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from lxml import etree
def getsource(url):
init = Options()
init.add_argument('--no-sandbox')
init.add_argument('--headless')
init.add_argument('--disable-gpu')
init.add_argument("disable-cache")
init.add_argument('disable-infobars')
init.add_argument('log-level=3') # INFO = 0 WARNING = 1 LOG_ERROR = 2 LOG_FATAL = 3 default is 0
init.add_experimental_option("excludeSwitches",['enable-automation','enable-logging'])
driver = webdriver.Chrome(chrome_options = init)
driver.implicitly_wait(10)
driver.get(url)
response = etree.HTML(driver.page_source)
response = etree.tostring(response, encoding = "utf-8", pretty_print = True, method = "html")
response = response.decode('utf-8')
driver.close()
return response
```
一些备忘录
```python
text = """this is test content;这是测试内容。"""
html1 = etree.HTML(text)
# html1 = etree.fromstring(text) # 同HTML()
# 方法1 使用html.unescape()
res = etree.tostring(html1)
print(html.unescape(res.decode('utf-8')))
# 方法2 使用uft-8编码
res = etree.tostring(html1,encoding="utf-8") # 这种方法对标签用的中文属性无效
print(res.decode('utf-8'))
# 方法1 使用open读取文档做字符串处理
with open('test.html') as f:
html1 = etree.HTML(f.read())
# 之后代码同 处理字符串 的两种方法
# 方法2 parse读取文档时指定编码方式
html1 = etree.parse('test.html',etree.HTMLParser(encoding='utf-8'))
# 这里要指定正确(与所读取文档相应的编码)的编码方式,不然后面会出现乱码
# 之后代码同 处理字符串 的两种方法
```
请求头cookie等
```python
# 访问 https://httpbin.org/get?show_env=1 可以返回当前浏览器的请求信息
options.add_argument('lang=zh_CN.UTF-8')
# 贴一个用json模块保存cookies
def getCookies():
with open('cookies.json', 'r', encoding='utf-8') as fd:
listCookies = json.loads(fd.read())
for cookie in listCookies:
cookies = {
'domain': cookie['domain'],
'httpOnly': cookie['httpOnly'],
'name':cookie['name'],
'path':'/',
'secure': cookie['secure'],
'value':cookie['value'],
}
print(cookies)
def saveCookies(driver):
jsonCookies = json.dumps(driver.get_cookies())
with open('cookies.json', 'w', encoding='utf-8') as fd:
fd.write(jsonCookies)
```
ChromeDriver
下载 [ChromeDriver](https://chromedriver.chromium.org/home) 放到当前目录就行(如果是放在 python 根目录可以不用在实例化 selenium 时指定chromedriver 路径)
### Matplotlib
[python数据可视化MatLab开源替代方案](https://www.runoob.com/numpy/numpy-matplotlib.html)
用pip管理器安装`pip install matplotlib`
```python
# 使用方法
import numpy as np
from matplotlib import pyplot as plt
x = np.arange(1,11)
y = 2 * x + 5
plt.title("Matplotlib demo")
plt.xlabel("x axis caption")
plt.ylabel("y axis caption")
plt.plot(x,y)
plt.show()
```
切换字体
```python
from matplotlib import pyplot as plt
import matplotlib
def getFont(): # 列出可用的字体
font = sorted([f.name for f in matplotlib.font_manager.fontManager.ttflist])
for i in font:
print(i)
# getFont()
plt.rcParams['font.family'] = ['Microsoft YaHei']
```
### Pandas
```python
import pandas as pd
df = pd.read_csv(filename, encoding = 'utf-8', header = 0, error_bad_lines = False)
df.columns # 查看所有列头的名字
df.xx # 获得xx那一列的信息
df['xx'] # 同上
df.sort_values(by = 'xx', ascending = True) # 按某一列排序
df.loc[index] # 取index行全部数据
df.loc[index][index2] # 取那行的某一数据
```
### Requests
经典老碟
```python
import requests
headers = { "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2486.0 Safari/537.36 Edge/13.10586"}
url = ""
session = requests.Session()
res = session.get(url, headers = headers)
# print(res.request.headers)
res.encoding = res.apparent_encoding # 'utf-8'
print(res.text)
```
### 正则表达式
```python
# 完全匹配浮点数
reg = [-+]?[0-9]*\.?[0-9]*
```
### 线程
多线程,手动版
```python
import threading
import time
threadlines = 16 # 默认调用16个线程不要超过20
flag = 1 # 判断主线程
def printTime(name):
print("name", time.ctime())
delay(4)
print("name", time.ctime())
threads = []
for thread in range(threadlines):
name = "thread " + str(thread)
athread = printTime(name)
athread.start()
threads.append(athread)
for thread in threads: # 加入阻塞,在子线程没完全结束前,保证主线程不断
thread.join()
```
线程锁
```python
import threading
import time
threadLock = threading.Lock()
threadlines = 16 # 默认调用16个线程不要超过20
flag = 1 # 判断主线程
def printTime(name):
print("name", time.ctime())
delay(4)
print("newname", time.ctime())
newtime = str(time.ctime())
threadLock.acquire() # 获得对txt文件的锁独享操作权限
write2txt(newname)
threadLock.release() # 释放锁(把独享权限让出)
def write2txt(name):
with open('test.txt', 'a+', encoding = 'utf-8') as fd:
fd.write(name)
threads = []
for thread in range(4):
name = "thread " + str(thread)
athread = printTime(name)
athread.start()
threads.append(athread)
for thread in threads: # 加入阻塞,在子线程没完全结束前,保证主线程不断
thread.join()
```
线程池,建议用
```python
from concurrent.futures import ThreadPoolExecutor
import time
def printTime(name):
print("name", time.ctime())
delay(4)
print("newname", time.ctime())
with ThreadPoolExecutor(max_workers = 10) as thread:
for count in range(10):
name = "thread" + str(count)
task = thread.submit(printTime, (name)) # 传入函数和对应需要的参数
print(task.done()) # 查看该线程是否完成bool
print(task.result()) # 返回上面 printTime 函数的返回值
```
### Redis
```python
# 安装 redis 模块
## pip install redis
# 实例对象
redisconn = redis.Redis(host = '127.0.0.1', port = '6379', password = 'x', db = 0)
# redis 取出的结果默认是字节,我们可以设定 decode_responses=True 改成字符串
```
## 备注
- 没有历史查询
在没有使用线程之前,完整跑完五个种类共(30 x 10 x 5 = 1500)条数据用时365s
使用线程数为5的情况下完整跑完五个种类共 1500条数据用时130s
使用线程数为16的情况下完整跑完五个种类共 1500条数据用时80s
- 加了历史查询
在不使用线程池的情况下,完整跑完 1500条数据用时很久
在使用线程池的情况下,完整跑完 1500条数据用时544秒
- 目前已知问题
- 在非windows环境下打开可视化界面时会找不到字体解决方法是修改 settings.py 中的字体为自己当前操作系统所有的字体。使用view.py中的getFont方法能列出当前系统所有的字体。
## 参考链接
1[selenium+python自动化100-centos上搭建selenium启动chrome浏览器headless无界面模式](https://www.cnblogs.com/yoyoketang/p/11582012.html)
2[解决:'chromedriver' executable needs to be in PATH问题](https://www.cnblogs.com/Neeo/articles/13949854.html)
3[Python selenium-chrome禁用打印日志输出](https://blog.csdn.net/wm9028/article/details/107536929)
4[Python将list逐行读入到csv文件中](https://blog.csdn.net/weixin_41068770/article/details/103145660)
5[Git中使用.gitignore忽略文件的推送](https://blog.csdn.net/lk142500/article/details/82869018)
6[python 3 实现定义跨模块的全局变量和使用](https://codeantenna.com/a/9YbdOKrrSJ)
7[Python 多线程](https://www.runoob.com/python/python-multithreading.html)
8[Python redis 使用介绍](https://www.runoob.com/w3cnote/python-redis-intro.html)
9[python + redis 实现 分布式队列任务](https://cloud.tencent.com/developer/article/1697383)
10[深入理解Python线程中join()函数](https://www.linuxidc.com/Linux/2019-03/157795.htm)
11[如何理解Python装饰器- 知乎](https://www.zhihu.com/question/26930016/answer/360300235)
12[【自动化】selenium设置请求头](https://www.jianshu.com/p/419eb4e00963)
13[python selenium 保存cookie 读取cookie](https://blog.csdn.net/fox64194167/article/details/80542717)
14[Selenium添加Cookie的方法](https://cloud.tencent.com/developer/article/1616175)
15[requests库使用方法汇总笔记](https://wenku.baidu.com/view/fa71322401020740be1e650e52ea551810a6c928.html)
16[爬虫常见的HTTP错误代码及错误原因](https://blog.csdn.net/Smart_look/article/details/109967222)
17[Python字符串操作之字符串分割与组合](https://blog.csdn.net/seetheworld518/article/details/47346527)
18[python线程池](https://www.cnblogs.com/liyuanhong/p/15767817.html)
19[python matplotlib坐标轴设置的方法](https://www.csdn.net/tags/NtzaUgxsOTQ2NjgtYmxvZwO0O0OO0O0O.html)
20[史上最全用Pandas读取CSV看这篇就够了](https://cloud.tencent.com/developer/article/1856554)
21[pandas数据处理的常用操作](https://zhuanlan.zhihu.com/p/29535766)
22[★★pandas的数据输出显示设置](https://www.jianshu.com/p/5c0aa1fa19af)
23[解决pandasValueError: Cannot convert non-finite values (NA or inf) to integer](https://blog.csdn.net/zhongkeyuanchongqing/article/details/123599260)
24[pandas取dataframe特定行/列](https://www.cnblogs.com/nxf-rabbit75/p/10105271.html)
25[Pandas 获取DataFrame 的行索引和列索引](https://blog.csdn.net/YENTERTAINR/article/details/109254583)
26