You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

479 lines
17 KiB

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

{
"cells": [
{
"cell_type": "raw",
"id": "69e76aa7-2c5d-4114-a302-85e17cc83e2c",
"metadata": {},
"source": [
"本文旨在通过一个案例(读取 data 目录下 100 篇小说文本,统计词频并输出前 10 高频词)来说明结构化编程和封装方法如何提升代码工程质量。\n",
"教案将逐步展示不同结构化方法和封装技术的应用,并分析其对代码可读性、可维护性、可扩展性和复用性的提升。"
]
},
{
"cell_type": "markdown",
"id": "b9a9a366-7fd3-422b-b3bc-b0bc00374da6",
"metadata": {},
"source": [
"# 教学目标\n",
"- 掌握封装方法(函数、类、模块)在代码组织中的作用。"
]
},
{
"cell_type": "markdown",
"id": "1387e026-c978-4217-9015-ab0e047c01a0",
"metadata": {},
"source": [
"## 第一部分:基础实现(无结构化、无封装)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "33803186-d890-4cd7-9636-8920fcb86e14",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"files = os.listdir('data')\n",
"word_count = {}\n",
"for file in files:\n",
" with open('data/' + file, 'r', encoding='utf-8') as f:\n",
" text = f.read()\n",
" words = text.split() # 假设简单按空格分词\n",
" for word in words:\n",
" if word in word_count:\n",
" word_count[word] += 1\n",
" else:\n",
" word_count[word] = 1\n",
"\n",
"# 排序并输出前10\n",
"sorted_words = sorted(word_count.items(), key=lambda x: x[1], reverse=True)\n",
"for i in range(10):\n",
" print(sorted_words[i])"
]
},
{
"cell_type": "markdown",
"id": "471351e7-8645-4690-973a-7d8de53bda5f",
"metadata": {},
"source": [
"### 问题分析\n",
"\n",
"- 可读性差:没有清晰的功能划分,代码逻辑混杂,难以阅读理解维护。\n",
"- 扩展性差:如果需要更改分词逻辑、文件路径或输出格式,需修改多处代码。\n",
"- 容错性差:未处理文件读取失败、空文件等问题。\n",
"- 复用性低:逻辑无法直接复用在其他类似任务中。"
]
},
{
"cell_type": "markdown",
"id": "a5881283-c295-4433-8edd-f915201a5f43",
"metadata": {},
"source": [
"## 第二部分:引入函数封装\n",
"\n",
"提炼出若干函数,减少代码的复杂性,提高可读性和可维护性。"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7beadc81-f939-4ac5-b885-407c6810b7de",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"def read_file(file_path):\n",
" \"\"\"读取单个文件内容\"\"\"\n",
" try:\n",
" with open(file_path, 'r', encoding='utf-8') as f:\n",
" return f.read()\n",
" except Exception as e:\n",
" print(f\"Error reading {file_path}: {e}\")\n",
" return \"\"\n",
"\n",
"def get_words(text):\n",
" \"\"\"简单分词(按空格)\"\"\"\n",
" return text.split()\n",
"\n",
"def count_words(words):\n",
" \"\"\"统计词频\"\"\"\n",
" word_count = {}\n",
" for word in words:\n",
" word_count[word] = word_count.get(word, 0) + 1\n",
" return word_count\n",
"\n",
"def get_top_n(word_count, n=10):\n",
" \"\"\"获取前 N 高频词\"\"\"\n",
" return sorted(word_count.items(), key=lambda x: x[1], reverse=True)[:n]\n",
"\n",
"def main():\n",
" \"\"\"主函数,控制流程\"\"\"\n",
" word_count = {}\n",
" data_dir = 'data'\n",
" \n",
" # 顺序结构:按步骤读取文件、处理文本\n",
" for file in os.listdir(data_dir):\n",
" file_path = os.path.join(data_dir, file)\n",
" # 选择结构:检查文件是否为 txt\n",
" if file_path.endswith('.txt'):\n",
" text = read_file(file_path)\n",
" # 循环结构:处理每个文件的词\n",
" words = get_words(text)\n",
" file_word_count = count_words(words)\n",
" # 合并词频\n",
" for word, count in file_word_count.items():\n",
" word_count[word] = word_count.get(word, 0) + count\n",
" \n",
" # 输出结果\n",
" top_words = get_top_n(word_count)\n",
" for word, count in top_words:\n",
" print(f\"{word}: {count}\")\n",
"\n",
"if __name__ == '__main__':\n",
" main()"
]
},
{
"cell_type": "markdown",
"id": "4f7218a3-43d2-4159-9854-9880020c42fc",
"metadata": {},
"source": [
"### 改进分析\n",
" - 逻辑分层main() 函数清晰定义了程序执行步骤(读取文件 -> 分词 -> 统计 -> 输出)。\n",
" - 模块化将功能拆分为函数read_file、get_words、count_words、get_top_n提高代码复用性和可读性。\n",
" - 错误处理:增加 try-except 处理文件读取异常。\n",
" - 工程质量提升:\n",
" - 可读性:函数命名本身就帮助理解代码,逻辑分块。\n",
" - 可维护性:修改某部分功能(如分词逻辑)只需改对应函数。\n",
" - 复用性:函数可复用在其他类似任务中。"
]
},
{
"cell_type": "markdown",
"id": "50737966-57c9-4daf-ac3b-6a1c73b18136",
"metadata": {},
"source": [
"## 第三部分:引入类封装\n",
"\n",
"通过类封装功能,进一步提高代码的模块化、可扩展性和复用性。"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "81aa7f9c-de28-4a7a-8ba1-130c3e5e4f7f",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import jieba\n",
"from collections import Counter\n",
"\n",
"class TextAnalyzer:\n",
" \"\"\"文本分析类,封装词频统计功能\"\"\"\n",
" def __init__(self, data_dir='data', top_n=10):\n",
" self.data_dir = data_dir\n",
" self.top_n = top_n\n",
" self.word_count = Counter()\n",
"\n",
" def read_file(self, file_path):\n",
" \"\"\"读取文件内容\"\"\"\n",
" try:\n",
" with open(file_path, 'r', encoding='utf-8') as f:\n",
" return f.read()\n",
" except Exception as e:\n",
" print(f\"Error reading {file_path}: {e}\")\n",
" return \"\"\n",
"\n",
" def tokenize(self, text):\n",
" \"\"\"使用 jieba 进行中文分词\"\"\"\n",
" return jieba.lcut(text)\n",
"\n",
" def process_file(self, file_path):\n",
" \"\"\"处理单个文件\"\"\"\n",
" if file_path.endswith('.txt'):\n",
" text = self.read_file(file_path)\n",
" words = self.tokenize(text)\n",
" self.word_count.update(words)\n",
"\n",
" def process_directory(self):\n",
" \"\"\"处理目录下所有文件\"\"\"\n",
" for file in os.listdir(self.data_dir):\n",
" file_path = os.path.join(self.data_dir, file)\n",
" self.process_file(file_path)\n",
"\n",
" def get_top_words(self):\n",
" \"\"\"获取前 N 高频词\"\"\"\n",
" return self.word_count.most_common(self.top_n)\n",
"\n",
" def run(self):\n",
" \"\"\"执行词频统计\"\"\"\n",
" self.process_directory()\n",
" top_words = self.get_top_words()\n",
" for word, count in top_words:\n",
" print(f\"{word}: {count}\")\n",
"\n",
"def main():\n",
" analyzer = TextAnalyzer(data_dir='data', top_n=10)\n",
" analyzer.run()\n",
"\n",
"if __name__ == '__main__':\n",
" main()"
]
},
{
"cell_type": "markdown",
"id": "62e780d4-94de-4830-89c2-ab2c96500fc5",
"metadata": {},
"source": [
"### 改进分析\n",
"- 面向对象封装:\n",
" - 使用 TextAnalyzer 类将所有功能封装为一个对象,数据(如 word_count和方法如 tokenize绑定在一起。\n",
" - 通过 __init__ 提供配置(如 data_dir 和 top_n提高灵活性。\n",
" \n",
"- 模块化:类方法分工明确(如 read_file、tokenize、process_file便于扩展。\n",
"- 工程质量提升:\n",
" - 可扩展性:可通过继承 TextAnalyzer 添加新功能(如支持其他分词器或文件格式)。\n",
" - 复用性:类可实例化多次,用于不同目录或参数。\n",
" - 可维护性:逻辑集中在类中,修改相对安全。"
]
},
{
"cell_type": "markdown",
"id": "9b4e17c4-f47e-4245-b3d9-e40fde0a2e04",
"metadata": {},
"source": [
"# 第四部分:引入文件模块封装\n",
"将代码进一步模块化到不同文件,引入配置文件和停用词过滤。"
]
},
{
"cell_type": "raw",
"id": "aadb5aea-8cc5-4a0f-9f5b-7eab28e90f1a",
"metadata": {},
"source": [
"目录结构\n",
"\n",
"project/\n",
"├── data/ # 小说文本目录\n",
"├── config.yaml # 配置文件\n",
"├── stop_words.txt # 停用词文件\n",
"├── text_analyzer.py # 分析模块\n",
"├── main.py # 主程序"
]
},
{
"cell_type": "raw",
"id": "2de4767b-8928-4f3f-8c8b-3c3cba2bc98a",
"metadata": {},
"source": [
"# config.yaml\n",
"\n",
"data_dir: data\n",
"top_n: 10\n",
"stop_words_file: stop_words.txt\n",
"output_file: output.txt"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9b442d61-c937-4757-b7b4-b6fc047c3529",
"metadata": {},
"outputs": [],
"source": [
"# text_analyzer.py\n",
"\n",
"import os\n",
"import jieba\n",
"from collections import Counter\n",
"import yaml\n",
"\n",
"class TextAnalyzer:\n",
" def __init__(self, config_path='config.yaml'):\n",
" with open(config_path, 'r', encoding='utf-8') as f:\n",
" config = yaml.safe_load(f)\n",
" self.data_dir = config['data_dir']\n",
" self.top_n = config['top_n']\n",
" self.stop_words_file = config['stop_words_file']\n",
" self.output_file = config['output_file']\n",
" self.word_count = Counter()\n",
" self.stop_words = self.load_stop_words()\n",
"\n",
" def load_stop_words(self):\n",
" \"\"\"加载停用词\"\"\"\n",
" try:\n",
" with open(self.stop_words_file, 'r', encoding='utf-8') as f:\n",
" return set(line.strip() for line in f if line.strip())\n",
" except Exception as e:\n",
" print(f\"Error loading stop words: {e}\")\n",
" return set()\n",
"\n",
" def read_file(self, file_path):\n",
" \"\"\"读取文件内容\"\"\"\n",
" try:\n",
" with open(file_path, 'r', encoding='utf-8') as f:\n",
" return f.read()\n",
" except Exception as e:\n",
" print(f\"Error reading {file_path}: {e}\")\n",
" return \"\"\n",
"\n",
" def tokenize(self, text):\n",
" \"\"\"中文分词并过滤停用词\"\"\"\n",
" words = jieba.lcut(text)\n",
" return [word for word in words if word not in self.stop_words]\n",
"\n",
" def process_file(self, file_path):\n",
" \"\"\"处理单个文件\"\"\"\n",
" if file_path.endswith('.txt'):\n",
" text = self.read_file(file_path)\n",
" words = self.tokenize(text)\n",
" self.word_count.update(words)\n",
"\n",
" def process_directory(self):\n",
" \"\"\"处理目录下所有文件\"\"\"\n",
" for file in os.listdir(self.data_dir):\n",
" file_path = os.path.join(self.data_dir, file)\n",
" self.process_file(file_path)\n",
"\n",
" def get_top_words(self):\n",
" \"\"\"获取前 N 高频词\"\"\"\n",
" return self.word_count.most_common(self.top_n)\n",
"\n",
" def save_results(self, top_words):\n",
" \"\"\"保存结果到文件\"\"\"\n",
" with open(self.output_file, 'w', encoding='utf-8') as f:\n",
" for word, count in top_words:\n",
" f.write(f\"{word}: {count}\\n\")\n",
"\n",
" def run(self):\n",
" \"\"\"执行词频统计并保存结果\"\"\"\n",
" self.process_directory()\n",
" top_words = self.get_top_words()\n",
" self.save_results(top_words)\n",
" for word, count in top_words:\n",
" print(f\"{word}: {count}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "22f58992-0108-4c90-894d-e756e7301a5a",
"metadata": {},
"outputs": [],
"source": [
"# main.py\n",
"\n",
"from text_analyzer import TextAnalyzer\n",
"\n",
"def main():\n",
" analyzer = TextAnalyzer()\n",
" analyzer.run()\n",
"\n",
"if __name__ == '__main__':\n",
" main()"
]
},
{
"cell_type": "markdown",
"id": "18d27410-8923-4662-a6b7-8e027609506e",
"metadata": {},
"source": [
"## 改进分析\n",
"\n",
"- 模块化:将分析逻辑放入 text_analyzer.py主程序 main.py 仅负责调用,符合工程化项目结构。\n",
"- 配置文件:通过 config.yaml 配置参数,增强灵活性,无需修改代码即可更改目录、输出文件等。\n",
"- 输出到文件:增加 save_results 方法,支持结果持久化。\n",
"- 工程质量提升:\n",
" - 可维护性:配置文件和模块化分离了配置与逻辑,修改配置无需动代码。 \n",
" - 复用性:模块可导入到其他项目,类可重复实例化。"
]
},
{
"cell_type": "markdown",
"id": "10876929-69f9-43bf-ba2d-a5d7bb11f22b",
"metadata": {},
"source": [
"### 封装的总节\n",
"\n",
"封装方法:\n",
"- 模块化:函数划分逻辑,降低耦合。\n",
"- 函数封装:将重复逻辑封装为函数,提高复用性。\n",
"- 类封装:将数据和方法绑定,增强代码组织性和扩展性。\n",
"- 文件封装:通过文件模块化,符合工程化开发规范。\n",
"\n",
"工程质量提升:\n",
"- 分离配置与逻辑,降低维护成本。\n",
"- 模块化和面向对象设计支持功能扩展。\n",
"- 错误处理提高程序鲁棒性。"
]
},
{
"cell_type": "raw",
"id": "60ba30d8-d8c2-4183-996e-376ff71716bf",
"metadata": {},
"source": [
"## 另外一种文件模块化设计(分层架构)示例\n",
"\n",
"将代码拆分为独立模块,每个模块仅负责单一职责:\n",
" - 数据读取层:遍历目录、读取文件内容\n",
" - 数据处理层:文本清洗、分词、统计词频\n",
" - 结果输出层排序并输出前10高频词\n",
"\n",
"目录结构:\n",
"project/\n",
"├── data_loader.py # 数据读取模块\n",
"├── text_processor.py # 数据处理模块\n",
"├── output_handler.py # 结果输出模块\n",
"└── main.py # 主程序入口"
]
},
{
"cell_type": "markdown",
"id": "517759ac-c4cf-402e-86f1-a9fae0d88bbb",
"metadata": {},
"source": [
"# 第七部分:运行说明\n",
"\n",
"环境准备:\n",
"- 安装 Python 3.8+。\n",
"- 安装依赖pip install jieba pyyaml。\n",
"- 准备 data 目录,放入 100 个 txt 文件。\n",
"- 创建 stop_words.txt 和 config.yaml。"
]
},
{
"cell_type": "markdown",
"id": "a7e1836b-42a1-45f9-bf8c-2e04a38744e4",
"metadata": {},
"source": [
"通过从无结构到结构化,再到面向对象和模块化的逐步优化,展示了结构化编程和封装方法如何显著提升代码工程质量。最终实现不仅满足了词频统计需求,还具备高可读性、可维护性、可扩展性和复用性,适合实际工程应用。"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}