{ "cells": [ { "cell_type": "raw", "id": "69e76aa7-2c5d-4114-a302-85e17cc83e2c", "metadata": {}, "source": [ "本文旨在通过一个案例（读取 data 目录下 100 篇小说文本，统计词频并输出前 10 高频词）来说明结构化编程和封装方法如何提升代码工程质量。\n", "教案将逐步展示不同结构化方法和封装技术的应用，并分析其对代码可读性、可维护性、可扩展性和复用性的提升。" ] }, { "cell_type": "markdown", "id": "b9a9a366-7fd3-422b-b3bc-b0bc00374da6", "metadata": {}, "source": [ "# 教学目标\n", "- 掌握封装方法（函数、类、模块）在代码组织中的作用。" ] }, { "cell_type": "markdown", "id": "1387e026-c978-4217-9015-ab0e047c01a0", "metadata": {}, "source": [ "## 第一部分：基础实现（无结构化、无封装）" ] }, { "cell_type": "code", "execution_count": null, "id": "33803186-d890-4cd7-9636-8920fcb86e14", "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "files = os.listdir('data')\n", "word_count = {}\n", "for file in files:\n", " with open('data/' + file, 'r', encoding='utf-8') as f:\n", " text = f.read()\n", " words = text.split() # 假设简单按空格分词\n", " for word in words:\n", " if word in word_count:\n", " word_count[word] += 1\n", " else:\n", " word_count[word] = 1\n", "\n", "# 排序并输出前10\n", "sorted_words = sorted(word_count.items(), key=lambda x: x[1], reverse=True)\n", "for i in range(10):\n", " print(sorted_words[i])" ] }, { "cell_type": "markdown", "id": "471351e7-8645-4690-973a-7d8de53bda5f", "metadata": {}, "source": [ "### 问题分析\n", "\n", "- 可读性差：没有清晰的功能划分，代码逻辑混杂，难以阅读理解维护。\n", "- 扩展性差：如果需要更改分词逻辑、文件路径或输出格式，需修改多处代码。\n", "- 容错性差：未处理文件读取失败、空文件等问题。\n", "- 复用性低：逻辑无法直接复用在其他类似任务中。" ] }, { "cell_type": "markdown", "id": "a5881283-c295-4433-8edd-f915201a5f43", "metadata": {}, "source": [ "## 第二部分：引入函数封装\n", "\n", "提炼出若干函数，减少代码的复杂性，提高可读性和可维护性。" ] }, { "cell_type": "code", "execution_count": null, "id": "7beadc81-f939-4ac5-b885-407c6810b7de", "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "def read_file(file_path):\n", " \"\"\"读取单个文件内容\"\"\"\n", " try:\n", " with open(file_path, 'r', encoding='utf-8') as f:\n", " return f.read()\n", " except Exception as e:\n", " print(f\"Error reading {file_path}: {e}\")\n", " return \"\"\n", "\n", "def get_words(text):\n", " \"\"\"简单分词（按空格）\"\"\"\n", " return text.split()\n", "\n", "def count_words(words):\n", " \"\"\"统计词频\"\"\"\n", " word_count = {}\n", " for word in words:\n", " word_count[word] = word_count.get(word, 0) + 1\n", " return word_count\n", "\n", "def get_top_n(word_count, n=10):\n", " \"\"\"获取前 N 高频词\"\"\"\n", " return sorted(word_count.items(), key=lambda x: x[1], reverse=True)[:n]\n", "\n", "def main():\n", " \"\"\"主函数，控制流程\"\"\"\n", " word_count = {}\n", " data_dir = 'data'\n", " \n", " # 顺序结构：按步骤读取文件、处理文本\n", " for file in os.listdir(data_dir):\n", " file_path = os.path.join(data_dir, file)\n", " # 选择结构：检查文件是否为 txt\n", " if file_path.endswith('.txt'):\n", " text = read_file(file_path)\n", " # 循环结构：处理每个文件的词\n", " words = get_words(text)\n", " file_word_count = count_words(words)\n", " # 合并词频\n", " for word, count in file_word_count.items():\n", " word_count[word] = word_count.get(word, 0) + count\n", " \n", " # 输出结果\n", " top_words = get_top_n(word_count)\n", " for word, count in top_words:\n", " print(f\"{word}: {count}\")\n", "\n", "if __name__ == '__main__':\n", " main()" ] }, { "cell_type": "markdown", "id": "4f7218a3-43d2-4159-9854-9880020c42fc", "metadata": {}, "source": [ "### 改进分析\n", " - 逻辑分层：main() 函数清晰定义了程序执行步骤（读取文件 -> 分词 -> 统计 -> 输出）。\n", " - 模块化：将功能拆分为函数（read_file、get_words、count_words、get_top_n），提高代码复用性和可读性。\n", " - 错误处理：增加 try-except 处理文件读取异常。\n", " - 工程质量提升：\n", " - 可读性：函数命名本身就帮助理解代码，逻辑分块。\n", " - 可维护性：修改某部分功能（如分词逻辑）只需改对应函数。\n", " - 复用性：函数可复用在其他类似任务中。" ] }, { "cell_type": "markdown", "id": "50737966-57c9-4daf-ac3b-6a1c73b18136", "metadata": {}, "source": [ "## 第三部分：引入类封装\n", "\n", "通过类封装功能，进一步提高代码的模块化、可扩展性和复用性。" ] }, { "cell_type": "code", "execution_count": null, "id": "81aa7f9c-de28-4a7a-8ba1-130c3e5e4f7f", "metadata": {}, "outputs": [], "source": [ "import os\n", "import jieba\n", "from collections import Counter\n", "\n", "class TextAnalyzer:\n", " \"\"\"文本分析类，封装词频统计功能\"\"\"\n", " def __init__(self, data_dir='data', top_n=10):\n", " self.data_dir = data_dir\n", " self.top_n = top_n\n", " self.word_count = Counter()\n", "\n", " def read_file(self, file_path):\n", " \"\"\"读取文件内容\"\"\"\n", " try:\n", " with open(file_path, 'r', encoding='utf-8') as f:\n", " return f.read()\n", " except Exception as e:\n", " print(f\"Error reading {file_path}: {e}\")\n", " return \"\"\n", "\n", " def tokenize(self, text):\n", " \"\"\"使用 jieba 进行中文分词\"\"\"\n", " return jieba.lcut(text)\n", "\n", " def process_file(self, file_path):\n", " \"\"\"处理单个文件\"\"\"\n", " if file_path.endswith('.txt'):\n", " text = self.read_file(file_path)\n", " words = self.tokenize(text)\n", " self.word_count.update(words)\n", "\n", " def process_directory(self):\n", " \"\"\"处理目录下所有文件\"\"\"\n", " for file in os.listdir(self.data_dir):\n", " file_path = os.path.join(self.data_dir, file)\n", " self.process_file(file_path)\n", "\n", " def get_top_words(self):\n", " \"\"\"获取前 N 高频词\"\"\"\n", " return self.word_count.most_common(self.top_n)\n", "\n", " def run(self):\n", " \"\"\"执行词频统计\"\"\"\n", " self.process_directory()\n", " top_words = self.get_top_words()\n", " for word, count in top_words:\n", " print(f\"{word}: {count}\")\n", "\n", "def main():\n", " analyzer = TextAnalyzer(data_dir='data', top_n=10)\n", " analyzer.run()\n", "\n", "if __name__ == '__main__':\n", " main()" ] }, { "cell_type": "markdown", "id": "62e780d4-94de-4830-89c2-ab2c96500fc5", "metadata": {}, "source": [ "### 改进分析\n", "- 面向对象封装：\n", " - 使用 TextAnalyzer 类将所有功能封装为一个对象，数据（如 word_count）和方法（如 tokenize）绑定在一起。\n", " - 通过 __init__ 提供配置（如 data_dir 和 top_n），提高灵活性。\n", " \n", "- 模块化：类方法分工明确（如 read_file、tokenize、process_file），便于扩展。\n", "- 工程质量提升：\n", " - 可扩展性：可通过继承 TextAnalyzer 添加新功能（如支持其他分词器或文件格式）。\n", " - 复用性：类可实例化多次，用于不同目录或参数。\n", " - 可维护性：逻辑集中在类中，修改相对安全。" ] }, { "cell_type": "markdown", "id": "9b4e17c4-f47e-4245-b3d9-e40fde0a2e04", "metadata": {}, "source": [ "# 第四部分：引入文件模块封装\n", "将代码进一步模块化到不同文件，引入配置文件和停用词过滤。" ] }, { "cell_type": "raw", "id": "aadb5aea-8cc5-4a0f-9f5b-7eab28e90f1a", "metadata": {}, "source": [ "目录结构\n", "\n", "project/\n", "├── data/ # 小说文本目录\n", "├── config.yaml # 配置文件\n", "├── stop_words.txt # 停用词文件\n", "├── text_analyzer.py # 分析模块\n", "├── main.py # 主程序" ] }, { "cell_type": "raw", "id": "2de4767b-8928-4f3f-8c8b-3c3cba2bc98a", "metadata": {}, "source": [ "# config.yaml\n", "\n", "data_dir: data\n", "top_n: 10\n", "stop_words_file: stop_words.txt\n", "output_file: output.txt" ] }, { "cell_type": "code", "execution_count": null, "id": "9b442d61-c937-4757-b7b4-b6fc047c3529", "metadata": {}, "outputs": [], "source": [ "# text_analyzer.py\n", "\n", "import os\n", "import jieba\n", "from collections import Counter\n", "import yaml\n", "\n", "class TextAnalyzer:\n", " def __init__(self, config_path='config.yaml'):\n", " with open(config_path, 'r', encoding='utf-8') as f:\n", " config = yaml.safe_load(f)\n", " self.data_dir = config['data_dir']\n", " self.top_n = config['top_n']\n", " self.stop_words_file = config['stop_words_file']\n", " self.output_file = config['output_file']\n", " self.word_count = Counter()\n", " self.stop_words = self.load_stop_words()\n", "\n", " def load_stop_words(self):\n", " \"\"\"加载停用词\"\"\"\n", " try:\n", " with open(self.stop_words_file, 'r', encoding='utf-8') as f:\n", " return set(line.strip() for line in f if line.strip())\n", " except Exception as e:\n", " print(f\"Error loading stop words: {e}\")\n", " return set()\n", "\n", " def read_file(self, file_path):\n", " \"\"\"读取文件内容\"\"\"\n", " try:\n", " with open(file_path, 'r', encoding='utf-8') as f:\n", " return f.read()\n", " except Exception as e:\n", " print(f\"Error reading {file_path}: {e}\")\n", " return \"\"\n", "\n", " def tokenize(self, text):\n", " \"\"\"中文分词并过滤停用词\"\"\"\n", " words = jieba.lcut(text)\n", " return [word for word in words if word not in self.stop_words]\n", "\n", " def process_file(self, file_path):\n", " \"\"\"处理单个文件\"\"\"\n", " if file_path.endswith('.txt'):\n", " text = self.read_file(file_path)\n", " words = self.tokenize(text)\n", " self.word_count.update(words)\n", "\n", " def process_directory(self):\n", " \"\"\"处理目录下所有文件\"\"\"\n", " for file in os.listdir(self.data_dir):\n", " file_path = os.path.join(self.data_dir, file)\n", " self.process_file(file_path)\n", "\n", " def get_top_words(self):\n", " \"\"\"获取前 N 高频词\"\"\"\n", " return self.word_count.most_common(self.top_n)\n", "\n", " def save_results(self, top_words):\n", " \"\"\"保存结果到文件\"\"\"\n", " with open(self.output_file, 'w', encoding='utf-8') as f:\n", " for word, count in top_words:\n", " f.write(f\"{word}: {count}\\n\")\n", "\n", " def run(self):\n", " \"\"\"执行词频统计并保存结果\"\"\"\n", " self.process_directory()\n", " top_words = self.get_top_words()\n", " self.save_results(top_words)\n", " for word, count in top_words:\n", " print(f\"{word}: {count}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "22f58992-0108-4c90-894d-e756e7301a5a", "metadata": {}, "outputs": [], "source": [ "# main.py\n", "\n", "from text_analyzer import TextAnalyzer\n", "\n", "def main():\n", " analyzer = TextAnalyzer()\n", " analyzer.run()\n", "\n", "if __name__ == '__main__':\n", " main()" ] }, { "cell_type": "markdown", "id": "18d27410-8923-4662-a6b7-8e027609506e", "metadata": {}, "source": [ "## 改进分析\n", "\n", "- 模块化：将分析逻辑放入 text_analyzer.py，主程序 main.py 仅负责调用，符合工程化项目结构。\n", "- 配置文件：通过 config.yaml 配置参数，增强灵活性，无需修改代码即可更改目录、输出文件等。\n", "- 输出到文件：增加 save_results 方法，支持结果持久化。\n", "- 工程质量提升：\n", " - 可维护性：配置文件和模块化分离了配置与逻辑，修改配置无需动代码。 \n", " - 复用性：模块可导入到其他项目，类可重复实例化。" ] }, { "cell_type": "markdown", "id": "10876929-69f9-43bf-ba2d-a5d7bb11f22b", "metadata": {}, "source": [ "### 封装的总节\n", "\n", "封装方法：\n", "- 模块化：函数划分逻辑，降低耦合。\n", "- 函数封装：将重复逻辑封装为函数，提高复用性。\n", "- 类封装：将数据和方法绑定，增强代码组织性和扩展性。\n", "- 文件封装：通过文件模块化，符合工程化开发规范。\n", "\n", "工程质量提升：\n", "- 分离配置与逻辑，降低维护成本。\n", "- 模块化和面向对象设计支持功能扩展。\n", "- 错误处理提高程序鲁棒性。" ] }, { "cell_type": "raw", "id": "60ba30d8-d8c2-4183-996e-376ff71716bf", "metadata": {}, "source": [ "## 另外一种文件模块化设计（分层架构）示例\n", "\n", "将代码拆分为独立模块，每个模块仅负责单一职责：\n", " - 数据读取层：遍历目录、读取文件内容\n", " - 数据处理层：文本清洗、分词、统计词频\n", " - 结果输出层：排序并输出前10高频词\n", "\n", "目录结构：\n", "project/\n", "├── data_loader.py # 数据读取模块\n", "├── text_processor.py # 数据处理模块\n", "├── output_handler.py # 结果输出模块\n", "└── main.py # 主程序入口" ] }, { "cell_type": "markdown", "id": "517759ac-c4cf-402e-86f1-a9fae0d88bbb", "metadata": {}, "source": [ "# 第七部分：运行说明\n", "\n", "环境准备：\n", "- 安装 Python 3.8+。\n", "- 安装依赖：pip install jieba pyyaml。\n", "- 准备 data 目录，放入 100 个 txt 文件。\n", "- 创建 stop_words.txt 和 config.yaml。" ] }, { "cell_type": "markdown", "id": "a7e1836b-42a1-45f9-bf8c-2e04a38744e4", "metadata": {}, "source": [ "通过从无结构到结构化，再到面向对象和模块化的逐步优化，展示了结构化编程和封装方法如何显著提升代码工程质量。最终实现不仅满足了词频统计需求，还具备高可读性、可维护性、可扩展性和复用性，适合实际工程应用。" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.7" } }, "nbformat": 4, "nbformat_minor": 5 }