Go to file

zrj 8acfd27e54 tidy file		5 months ago
tool	complete	5 months ago
2025软工K班个人编程任务.md	tidy file	5 months ago
2025软工K班个人编程任务.pdf	tidy file	5 months ago
README.md	devise style	5 months ago
ai_danmu_stylecloud.png	complete the main_report	5 months ago
danmu_statistics.xlsx	complete analasy work effience by the cProfile	5 months ago
for_md.md	complete the main_report	5 months ago
image-1.png	complete the main_report	5 months ago
image-2.png	complete the main_report	5 months ago
image-3.png	complete the main_report	5 months ago
image-4.png	complete the main_report	5 months ago
image-5.png	complete the main_report	5 months ago
image-6.png	complete the main_report	5 months ago
image-7.png	complete the main_report	5 months ago
image-8.png	complete the main_report	5 months ago
image-9.png	complete the main_report	5 months ago
image-10.png	complete analasy work effience by the cProfile	5 months ago
image.png	complete the main_report	5 months ago
main.py	complete	5 months ago
profiler_stats	complete analasy work effience by the cProfile	5 months ago
raw_danmu.txt	complete analasy work effience by the cProfile	5 months ago
test.py	complete analasy work effience by the cProfile	5 months ago

README.md

Unescape Escape

大语言模型应用相关视频弹幕分析挖掘

项目设计与技术栈

技术栈：

编程语言：python
爬虫工具：request
数据处理：csv
可视化云图：python stylecloud库
其他工具：pytest,bs4,panda,csv

核心类与函数关系：

graph TD
    A[主程序] --> B[爬虫模块]
    A --> C[数据处理模块]
    A --> D[可视化模块]
    B --> E[网络请求]
    B --> F[数据解析]
    C --> G[数据清洗]
    C --> H[统计分析]

关键算法说明：

弹幕去重：使用Counter类进行自动去重和词频统计
关键词提取方法：使用提前准备好的keywords文件对爬取的数据进行提取
排名统计逻辑：根据数据的词频排序

业务逻辑设计：

分析目标网站（b站）视频结构，设计爬虫

获得基础url_base

self.url_ref = 'https://search.bilibili.com/all?vt=83547368&keyword=LLM'
self.headers = {
    "Referer":
    self.url_ref,
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
}
self.url_interface_cid = "https://api.bilibili.com/x/v1/dm/list.so?oid={cid}"
self.url_page_base = "https://search.bilibili.com/all?vt=85151086&keyword=LLM&page={page}&o={offset}"

获取LLM视频页面链接

'https://search.bilibili.com/all?vt=83547368&keyword=LLM'

使用浏览器f12获取视频链接selector

"#i_cecream > div > div:nth-child(2) > div.search-content--gray.search-content > div > div > div > div.video.i_wrapper.search-all-list > div > div:nth-child(1) > div > div.bili-video-card__wrap > a"

根据视频链接获取视频cid，再根据cid通过b站弹幕API接口网站获取弹幕 url_interface_cid = "https://api.bilibili.com/x/v1/dm/list.so?oid={cid}"
```
提取结果：
链接地址: https://www.bilibili.com/video/BV12N411x7FL/
链接文本: 19.2万21208:10
得到CID: 1308288574
```
可以通过接口url+cid访问弹幕内容

对爬取的数据进行清洗，使用关键字检索LLM

使用tool.keywords.py提供的words库,使用re为word_filter实现关键词检索和口语化弹幕剔除

class DanmakuFilter:
    """弹幕过滤器，去除口语化内容，保留专业讨论"""

    def __init__(self):
        # 定义语气词和口语化表达
        import tool.keywords as kw
        # 编译正则表达式
        self.colloquial_regex = re.compile('|'.join(kw.colloquial_patterns))
        self.llm_regex = re.compile('|'.join(kw.keywords), re.IGNORECASE)

对统计数据进行词云图绘制

使用python的stylecloud对处理好的弹幕进行词云图绘制

gen_stylecloud(
    text=text,  # 处理好的文本
    size=1024,  # 图片尺寸，越大越清晰
    font_path='msyh.ttc',  # 指定中文字体路径（如微软雅黑）
    output_name='ai_danmu_stylecloud.png',  # 输出文件名
    icon_name='fas fa-question-circle',
    custom_stopwords=['的', '了', '在', '是', '我', '有', '和', '机'],  # 自定义停用词
    palette='colorbrewer.qualitative.Set1_8',  # 使用预设配色方案
    # background_color='white',  # 背景色
    gradient='horizontal',  # 颜色渐变方向
    max_font_size=200,  # 最大字体大小
    max_words=500,  # 最多显示词数
)

README.md Unescape Escape

大语言模型应用相关视频弹幕分析挖掘

项目设计与技术栈

README.md

Unescape Escape