{ "cells": [ { "cell_type": "raw", "id": "eccfe49f-de35-4241-90e3-a7095940b61a", "metadata": {}, "source": [ "设计模式提供高频重复出现的需求的最佳解决方案。以下介绍适合词频统计案例的设计模式：策略模式、观察者模式、工厂模式。" ] }, { "cell_type": "markdown", "id": "c186171f-d1f2-433e-a3eb-b266e2909a2c", "metadata": {}, "source": [ "## 策略模式（动态选择分词策略）\n", "\n", "策略模式允许动态切换算法（如分词器），比元编程简单。" ] }, { "cell_type": "code", "execution_count": null, "id": "97c865cb-0b5a-4fa1-aa74-5ba2e65e7436", "metadata": {}, "outputs": [], "source": [ "from abc import ABC, abstractmethod\n", "\n", "class Tokenizer(ABC):\n", " \"\"\"分词器接口\"\"\"\n", " @abstractmethod\n", " def tokenize(self, text: str, stop_words: set) -> List[str]:\n", " pass\n", "\n", "class JiebaTokenizer(Tokenizer):\n", " \"\"\"jieba 分词器\"\"\"\n", " def tokenize(self, text: str, stop_words: set) -> List[str]:\n", " return [w for w in jieba.lcut(text) if w not in stop_words]\n", "\n", "class SimpleTokenizer(Tokenizer):\n", " \"\"\"简单分词器\"\"\"\n", " def tokenize(self, text: str, stop_words: set) -> List[str]:\n", " return [w for w in text.split() if w not in stop_words]\n", "\n", "class TextAnalyzer:\n", " def __init__(self, config_path='config.yaml'):\n", " with open(config_path, 'r', encoding='utf-8') as f:\n", " config = yaml.safe_load(f)\n", " self.data_dir = config['data_dir']\n", " self.top_n = config['top_n']\n", " self.stop_words_file = config['stop_words_file']\n", " self.output_file = config['output_file']\n", " self.stop_words = self.load_stop_words()\n", " self.word_count = Counter()\n", " # 动态选择分词器\n", " tokenizer_name = config.get('tokenizer', 'jieba')\n", " self.tokenizer = {'jieba': JiebaTokenizer(), 'simple': SimpleTokenizer()}[tokenizer_name]\n", "\n", " def tokenize(self, text: str) -> List[str]:\n", " \"\"\"使用策略分词\"\"\"\n", " return self.tokenizer.tokenize(text, self.stop_words)\n", "\n", " # 其余方法同上" ] }, { "cell_type": "markdown", "id": "5435ebc3-d3b0-4475-8bd5-cb45fb51638c", "metadata": {}, "source": [ "工程质量提升：\n", "- 可扩展性：添加新分词器只需实现 Tokenizer 接口。\n", "- 可维护性：分词逻辑与主类分离，修改更独立。\n", "\n", "适用场景：适合需要动态切换算法的场景。" ] }, { "cell_type": "markdown", "id": "fbf53455-558c-40fb-8718-446dec989b5d", "metadata": {}, "source": [ "## 观察者模式（结果输出解耦）\n", "\n", "观察者模式可用于解耦结果输出逻辑（如打印、保存文件、发送通知）。" ] }, { "cell_type": "code", "execution_count": null, "id": "d7a2bd4c-df73-4800-b45b-9b6c73d28d7b", "metadata": {}, "outputs": [], "source": [ "class OutputObserver(ABC):\n", " \"\"\"输出观察者接口\"\"\"\n", " @abstractmethod\n", " def update(self, top_words: List[Tuple[str, int]]):\n", " pass\n", "\n", "class ConsoleOutput(OutputObserver):\n", " \"\"\"控制台输出\"\"\"\n", " def update(self, top_words: List[Tuple[str, int]]):\n", " for word, count in top_words:\n", " print(f\"{word}: {count}\")\n", "\n", "class FileOutput(OutputObserver):\n", " \"\"\"文件输出\"\"\"\n", " def __init__(self, output_file: str):\n", " self.output_file = output_file\n", "\n", " def update(self, top_words: List[Tuple[str, int]]):\n", " with open(self.output_file, 'w', encoding='utf-8') as f:\n", " for word, count in top_words:\n", " f.write(f\"{word}: {count}\\n\")\n", "\n", "class TextAnalyzer:\n", " def __init__(self, config_path='config.yaml'):\n", " with open(config_path, 'r', encoding='utf-8') as f:\n", " config = yaml.safe_load(f)\n", " self.data_dir = config['data_dir']\n", " self.top_n = config['top_n']\n", " self.stop_words_file = config['stop_words_file']\n", " self.output_file = config['output_file']\n", " self.stop_words = self.load_stop_words()\n", " self.word_count = Counter()\n", " self.observers = [ConsoleOutput(), FileOutput(self.output_file)]\n", "\n", " def add_observer(self, observer: OutputObserver):\n", " \"\"\"添加观察者\"\"\"\n", " self.observers.append(observer)\n", "\n", " def notify_observers(self, top_words: List[Tuple[str, int]]):\n", " \"\"\"通知所有观察者\"\"\"\n", " for observer in self.observers:\n", " observer.update(top_words)\n", "\n", " def run(self):\n", " \"\"\"执行词频统计并通知观察者\"\"\"\n", " self.process_directory()\n", " top_words = self.get_top_words()\n", " self.notify_observers(top_words)\n", "\n", " # 其余方法同上" ] }, { "cell_type": "markdown", "id": "02b5cfba-431c-4a01-a454-099e4f41922c", "metadata": {}, "source": [ "### 分析\n", "\n", "工程质量提升：\n", " - 可扩展性：添加新输出方式只需实现 OutputObserver 接口。\n", " - 解耦性：输出逻辑与统计逻辑分离，修改输出不影响核心功能。\n", "\n", "适用场景：适合需要多种输出或通知的场景。\n", "\n", "局限性：观察者模式增加代码复杂性，适合复杂输出需求。" ] }, { "cell_type": "markdown", "id": "11669305-8cd5-4317-afd5-e85c3f0a5a81", "metadata": {}, "source": [ "## 工厂模式（动态创建分词器）\n", "\n", "工厂模式可用于动态创建分词器，简化策略模式中的初始化逻辑。" ] }, { "cell_type": "code", "execution_count": null, "id": "2fa50633-de22-40c8-912d-3ded5ebcedfc", "metadata": {}, "outputs": [], "source": [ "class TokenizerFactory:\n", " \"\"\"分词器工厂\"\"\"\n", " @staticmethod\n", " def create_tokenizer(name: str) -> Tokenizer:\n", " tokenizers = {\n", " 'jieba': JiebaTokenizer(),\n", " 'simple': SimpleTokenizer()\n", " }\n", " return tokenizers.get(name, JiebaTokenizer())\n", "\n", "class TextAnalyzer:\n", " def __init__(self, config_path='config.yaml'):\n", " with open(config_path, 'r', encoding='utf-8') as f:\n", " config = yaml.safe_load(f)\n", " self.data_dir = config['data_dir']\n", " self.top_n = config['top_n']\n", " self.stop_words_file = config['stop_words_file']\n", " self.output_file = config['output_file']\n", " self.stop_words = self.load_stop_words()\n", " self.word_count = Counter()\n", " self.tokenizer = TokenizerFactory.create_tokenizer(config.get('tokenizer', 'jieba'))\n", "\n", " # 其余方法同上" ] }, { "cell_type": "markdown", "id": "a4db7046-dfe2-4bd8-81d1-49a42e2eeb5c", "metadata": {}, "source": [ "### 分析\n", "\n", "工程质量提升：\n", " - 可维护性：分词器创建逻辑集中于工厂，易于修改。\n", " - 可扩展性：添加新分词器只需更新工厂方法。\n", "\n", "适用场景：适合需要动态创建对象的场景。\n", "\n", "局限性：对于简单场景，工厂模式可能略显冗余。" ] }, { "cell_type": "markdown", "id": "07158f09-703e-4abb-ac1a-881ba1b3b26d", "metadata": {}, "source": [ "## 附：元编程\n", "\n", "元编程允许动态修改类或函数行为，可用于动态配置分词器或输出格式。案例中，可通过元编程动态注册分词器。" ] }, { "cell_type": "code", "execution_count": null, "id": "4394008c-88da-44bd-aa0d-f1b7a6dbc7d6", "metadata": {}, "outputs": [], "source": [ "class TokenizerRegistry(type):\n", " \"\"\"元类：动态注册分词器\"\"\"\n", " tokenizers = {}\n", "\n", " def register_tokenizer(cls, name):\n", " def decorator(func):\n", " cls.tokenizers[name] = func\n", " return func\n", " return decorator\n", "\n", "class TextAnalyzer(metaclass=TokenizerRegistry):\n", " def __init__(self, config_path='config.yaml'):\n", " with open(config_path, 'r', encoding='utf-8') as f:\n", " config = yaml.safe_load(f)\n", " self.data_dir = config['data_dir']\n", " self.top_n = config['top_n']\n", " self.stop_words_file = config['stop_words_file']\n", " self.output_file = config['output_file']\n", " self.stop_words = self.load_stop_words()\n", " self.word_count = Counter()\n", " self.tokenizer_name = config.get('tokenizer', 'jieba')\n", "\n", " @classmethod\n", " def register_tokenizer(cls, name):\n", " return cls.__class__.register_tokenizer(name)\n", "\n", " def tokenize(self, text: str) -> List[str]:\n", " \"\"\"动态调用分词器\"\"\"\n", " tokenizer = self.__class__.tokenizers.get(self.tokenizer_name)\n", " return tokenizer(self, text)\n", "\n", " @register_tokenizer('jieba')\n", " def jieba_tokenizer(self, text: str) -> List[str]:\n", " \"\"\"jieba 分词\"\"\"\n", " return [w for w in jieba.lcut(text) if w not in self.stop_words]\n", "\n", " @register_tokenizer('simple')\n", " def simple_tokenizer(self, text: str) -> List[str]:\n", " \"\"\"简单分词（按空格）\"\"\"\n", " return [w for w in text.split() if w not in self.stop_words]\n", "\n", " # 其余方法（load_stop_words, process_file, etc.）同上" ] }, { "cell_type": "markdown", "id": "30ba75ea-f769-4f90-9075-27670db9ada4", "metadata": {}, "source": [ "### 分析\n", "\n", "工程质量提升：\n", "- 可扩展性：新分词器只需添加新方法并注册，无需修改核心部分。\n", "- 灵活性：通过配置文件动态选择分词器。\n", "\n", "适用场景：适合需要动态配置或插件化系统的场景。\n", "\n", "局限性：元编程增加代码复杂性，需要团队整体技术能力支持。" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.7" } }, "nbformat": 4, "nbformat_minor": 5 }