# 大语言模型应用相关视频弹幕分析挖掘 ## 项目设计与技术栈 **技术栈：** - 编程语言：python - 爬虫工具：request - 数据处理：csv - 可视化云图：python stylecloud库 - 其他工具：pytest,bs4,panda,csv **核心类与函数关系：** ```mermaid graph TD A[主程序] --> B[爬虫模块] A --> C[数据处理模块] A --> D[可视化模块] B --> E[网络请求] B --> F[数据解析] C --> G[数据清洗] C --> H[统计分析] ``` **关键算法说明：** - 弹幕去重：使用Counter类进行自动去重和词频统计 - 关键词提取方法：使用提前准备好的keywords文件对爬取的数据进行提取 - 排名统计逻辑：根据数据的词频排序 **业务逻辑设计：** 1. 分析目标网站（b站）视频结构，设计爬虫 * 获得基础url_base ```python self.url_ref = 'https://search.bilibili.com/all?vt=83547368&keyword=LLM' self.headers = { "Referer": self.url_ref, "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36", } self.url_interface_cid = "https://api.bilibili.com/x/v1/dm/list.so?oid={cid}" self.url_page_base = "https://search.bilibili.com/all?vt=85151086&keyword=LLM&page={page}&o={offset}" ``` * 获取LLM视频页面链接

![alt text](image-1.png) **'https://search.bilibili.com/all?vt=83547368&keyword=LLM'**

* 使用浏览器f12获取视频链接selector

![alt text](image-3.png) **"#i_cecream > div > div:nth-child(2) > div.search-content--gray.search-content > div > div > div > div.video.i_wrapper.search-all-list > div > div:nth-child(1) > div > div.bili-video-card__wrap > a"**

* 根据视频链接获取视频cid，再根据cid通过b站弹幕API接口网站获取弹幕 url_interface_cid = "https://api.bilibili.com/x/v1/dm/list.so?oid={cid}"

![alt text](image-4.png)

```txt 提取结果：链接地址: https://www.bilibili.com/video/BV12N411x7FL/ 链接文本: 19.2万21208:10 得到CID: 1308288574 ```

![alt text](image-5.png) **可以通过接口url+cid访问弹幕内容**

2. 对爬取的数据进行清洗，使用关键字检索LLM * 使用tool.keywords.py提供的words库,使用re为word_filter实现关键词检索和口语化弹幕剔除 ```python class DanmakuFilter: """弹幕过滤器，去除口语化内容，保留专业讨论""" def __init__(self): # 定义语气词和口语化表达 import tool.keywords as kw # 编译正则表达式 self.colloquial_regex = re.compile('|'.join(kw.colloquial_patterns)) self.llm_regex = re.compile('|'.join(kw.keywords), re.IGNORECASE) ``` 3. 对统计数据进行词云图绘制 * 使用python的stylecloud对处理好的弹幕进行词云图绘制 ```python gen_stylecloud( text=text, # 处理好的文本 size=1024, # 图片尺寸，越大越清晰 font_path='msyh.ttc', # 指定中文字体路径（如微软雅黑） output_name='ai_danmu_stylecloud.png', # 输出文件名 icon_name='fas fa-question-circle', custom_stopwords=['的', '了', '在', '是', '我', '有', '和', '机'], # 自定义停用词 palette='colorbrewer.qualitative.Set1_8', # 使用预设配色方案 # background_color='white', # 背景色 gradient='horizontal', # 颜色渐变方向 max_font_size=200, # 最大字体大小 max_words=500, # 最多显示词数 ) ```