You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

140 lines
4.0 KiB

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

<style>
.image-gallery {
display: flex;
justify-content: center;
gap: 20px;
flex-wrap: wrap;
margin: 10px 0;
}
.image-gallery p {
text-align: center;
margin: 10px 0 0 0;
font-size: 14px;
color: #666;
}
.image-gallery img {
width: 300px;
height: auto;
border: 1px solid #063827ff;
border-radius: 8px;
}
</style>
# 大语言模型应用相关视频弹幕分析挖掘
## 项目设计与技术栈
**技术栈:**
- 编程语言python
- 爬虫工具request
- 数据处理csv
- 可视化云图python stylecloud库
- 其他工具pytest,bs4,panda,csv
**核心类与函数关系:**
```mermaid
graph TD
A[主程序] --> B[爬虫模块]
A --> C[数据处理模块]
A --> D[可视化模块]
B --> E[网络请求]
B --> F[数据解析]
C --> G[数据清洗]
C --> H[统计分析]
```
**关键算法说明:**
- 弹幕去重使用Counter类进行自动去重和词频统计
- 关键词提取方法使用提前准备好的keywords文件对爬取的数据进行提取
- 排名统计逻辑:根据数据的词频排序
**业务逻辑设计:**
1. 分析目标网站b站视频结构设计爬虫
* 获得基础url_base
```python
self.url_ref = 'https://search.bilibili.com/all?vt=83547368&keyword=LLM'
self.headers = {
"Referer":
self.url_ref,
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
}
self.url_interface_cid = "https://api.bilibili.com/x/v1/dm/list.so?oid={cid}"
self.url_page_base = "https://search.bilibili.com/all?vt=85151086&keyword=LLM&page={page}&o={offset}"
```
* 获取LLM视频页面链接
<div class ='image-gallery'>
![alt text](image-1.png)
**'https://search.bilibili.com/all?vt=83547368&keyword=LLM'**
</div>
* 使用浏览器f12获取视频链接selector
<div class ='image-gallery'>
![alt text](image-3.png)
**"#i_cecream > div > div:nth-child(2) > div.search-content--gray.search-content > div > div > div > div.video.i_wrapper.search-all-list > div > div:nth-child(1) > div > div.bili-video-card__wrap > a"**
</div>
* 根据视频链接获取视频cid再根据cid通过b站弹幕API接口网站获取弹幕
url_interface_cid = "https://api.bilibili.com/x/v1/dm/list.so?oid={cid}"
<div class ='image-gallery'>
![alt text](image-4.png)
</div>
```txt
提取结果:
链接地址: https://www.bilibili.com/video/BV12N411x7FL/
链接文本: 19.2万21208:10
得到CID: 1308288574
```
<div class ='image-gallery'>
![alt text](image-5.png)
**可以通过接口url+cid访问弹幕内容**
</div>
2. 对爬取的数据进行清洗使用关键字检索LLM
* 使用tool.keywords.py提供的words库,使用re为word_filter实现关键词检索和口语化弹幕剔除
```python
class DanmakuFilter:
"""弹幕过滤器,去除口语化内容,保留专业讨论"""
def __init__(self):
# 定义语气词和口语化表达
import tool.keywords as kw
# 编译正则表达式
self.colloquial_regex = re.compile('|'.join(kw.colloquial_patterns))
self.llm_regex = re.compile('|'.join(kw.keywords), re.IGNORECASE)
```
3. 对统计数据进行词云图绘制
* 使用python的stylecloud对处理好的弹幕进行词云图绘制
```python
gen_stylecloud(
text=text, # 处理好的文本
size=1024, # 图片尺寸,越大越清晰
font_path='msyh.ttc', # 指定中文字体路径(如微软雅黑)
output_name='ai_danmu_stylecloud.png', # 输出文件名
icon_name='fas fa-question-circle',
custom_stopwords=['的', '了', '在', '是', '我', '有', '和', '机'], # 自定义停用词
palette='colorbrewer.qualitative.Set1_8', # 使用预设配色方案
# background_color='white', # 背景色
gradient='horizontal', # 颜色渐变方向
max_font_size=200, # 最大字体大小
max_words=500, # 最多显示词数
)
```