CodePattern/D Plus/03 设计模式.ipynb

{
 "cells": [
  {
   "cell_type": "raw",
   "id": "eccfe49f-de35-4241-90e3-a7095940b61a",
   "metadata": {},
   "source": [
    "设计模式提供高频重复出现的需求的最佳解决方案。以下介绍适合词频统计案例的设计模式：策略模式、观察者模式、工厂模式。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c186171f-d1f2-433e-a3eb-b266e2909a2c",
   "metadata": {},
   "source": [
    "##  策略模式（动态选择分词策略）\n",
    "\n",
    "策略模式允许动态切换算法（如分词器），比元编程简单。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "97c865cb-0b5a-4fa1-aa74-5ba2e65e7436",
   "metadata": {},
   "outputs": [],
   "source": [
    "from abc import ABC, abstractmethod\n",
    "\n",
    "class Tokenizer(ABC):\n",
    "    \"\"\"分词器接口\"\"\"\n",
    "    @abstractmethod\n",
    "    def tokenize(self, text: str, stop_words: set) -> List[str]:\n",
    "        pass\n",
    "\n",
    "class JiebaTokenizer(Tokenizer):\n",
    "    \"\"\"jieba 分词器\"\"\"\n",
    "    def tokenize(self, text: str, stop_words: set) -> List[str]:\n",
    "        return [w for w in jieba.lcut(text) if w not in stop_words]\n",
    "\n",
    "class SimpleTokenizer(Tokenizer):\n",
    "    \"\"\"简单分词器\"\"\"\n",
    "    def tokenize(self, text: str, stop_words: set) -> List[str]:\n",
    "        return [w for w in text.split() if w not in stop_words]\n",
    "\n",
    "class TextAnalyzer:\n",
    "    def __init__(self, config_path='config.yaml'):\n",
    "        with open(config_path, 'r', encoding='utf-8') as f:\n",
    "            config = yaml.safe_load(f)\n",
    "        self.data_dir = config['data_dir']\n",
    "        self.top_n = config['top_n']\n",
    "        self.stop_words_file = config['stop_words_file']\n",
    "        self.output_file = config['output_file']\n",
    "        self.stop_words = self.load_stop_words()\n",
    "        self.word_count = Counter()\n",
    "        # 动态选择分词器\n",
    "        tokenizer_name = config.get('tokenizer', 'jieba')\n",
    "        self.tokenizer = {'jieba': JiebaTokenizer(), 'simple': SimpleTokenizer()}[tokenizer_name]\n",
    "\n",
    "    def tokenize(self, text: str) -> List[str]:\n",
    "        \"\"\"使用策略分词\"\"\"\n",
    "        return self.tokenizer.tokenize(text, self.stop_words)\n",
    "\n",
    "    # 其余方法同上"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5435ebc3-d3b0-4475-8bd5-cb45fb51638c",
   "metadata": {},
   "source": [
    "工程质量提升：\n",
    "- 可扩展性：添加新分词器只需实现 Tokenizer 接口。\n",
    "- 可维护性：分词逻辑与主类分离，修改更独立。\n",
    "\n",
    "适用场景：适合需要动态切换算法的场景。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fbf53455-558c-40fb-8718-446dec989b5d",
   "metadata": {},
   "source": [
    "## 观察者模式（结果输出解耦）\n",
    "\n",
    "观察者模式可用于解耦结果输出逻辑（如打印、保存文件、发送通知）。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d7a2bd4c-df73-4800-b45b-9b6c73d28d7b",
   "metadata": {},
   "outputs": [],
   "source": [
    "class OutputObserver(ABC):\n",
    "    \"\"\"输出观察者接口\"\"\"\n",
    "    @abstractmethod\n",
    "    def update(self, top_words: List[Tuple[str, int]]):\n",
    "        pass\n",
    "\n",
    "class ConsoleOutput(OutputObserver):\n",
    "    \"\"\"控制台输出\"\"\"\n",
    "    def update(self, top_words: List[Tuple[str, int]]):\n",
    "        for word, count in top_words:\n",
    "            print(f\"{word}: {count}\")\n",
    "\n",
    "class FileOutput(OutputObserver):\n",
    "    \"\"\"文件输出\"\"\"\n",
    "    def __init__(self, output_file: str):\n",
    "        self.output_file = output_file\n",
    "\n",
    "    def update(self, top_words: List[Tuple[str, int]]):\n",
    "        with open(self.output_file, 'w', encoding='utf-8') as f:\n",
    "            for word, count in top_words:\n",
    "                f.write(f\"{word}: {count}\\n\")\n",
    "\n",
    "class TextAnalyzer:\n",
    "    def __init__(self, config_path='config.yaml'):\n",
    "        with open(config_path, 'r', encoding='utf-8') as f:\n",
    "            config = yaml.safe_load(f)\n",
    "        self.data_dir = config['data_dir']\n",
    "        self.top_n = config['top_n']\n",
    "        self.stop_words_file = config['stop_words_file']\n",
    "        self.output_file = config['output_file']\n",
    "        self.stop_words = self.load_stop_words()\n",
    "        self.word_count = Counter()\n",
    "        self.observers = [ConsoleOutput(), FileOutput(self.output_file)]\n",
    "\n",
    "    def add_observer(self, observer: OutputObserver):\n",
    "        \"\"\"添加观察者\"\"\"\n",
    "        self.observers.append(observer)\n",
    "\n",
    "    def notify_observers(self, top_words: List[Tuple[str, int]]):\n",
    "        \"\"\"通知所有观察者\"\"\"\n",
    "        for observer in self.observers:\n",
    "            observer.update(top_words)\n",
    "\n",
    "    def run(self):\n",
    "        \"\"\"执行词频统计并通知观察者\"\"\"\n",
    "        self.process_directory()\n",
    "        top_words = self.get_top_words()\n",
    "        self.notify_observers(top_words)\n",
    "\n",
    "    # 其余方法同上"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "02b5cfba-431c-4a01-a454-099e4f41922c",
   "metadata": {},
   "source": [
    "### 分析\n",
    "\n",
    "工程质量提升：\n",
    "  - 可扩展性：添加新输出方式只需实现 OutputObserver 接口。\n",
    "  - 解耦性：输出逻辑与统计逻辑分离，修改输出不影响核心功能。\n",
    "\n",
    "适用场景：适合需要多种输出或通知的场景。\n",
    "\n",
    "局限性：观察者模式增加代码复杂性，适合复杂输出需求。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11669305-8cd5-4317-afd5-e85c3f0a5a81",
   "metadata": {},
   "source": [
    "## 工厂模式（动态创建分词器）\n",
    "\n",
    "工厂模式可用于动态创建分词器，简化策略模式中的初始化逻辑。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2fa50633-de22-40c8-912d-3ded5ebcedfc",
   "metadata": {},
   "outputs": [],
   "source": [
    "class TokenizerFactory:\n",
    "    \"\"\"分词器工厂\"\"\"\n",
    "    @staticmethod\n",
    "    def create_tokenizer(name: str) -> Tokenizer:\n",
    "        tokenizers = {\n",
    "            'jieba': JiebaTokenizer(),\n",
    "            'simple': SimpleTokenizer()\n",
    "        }\n",
    "        return tokenizers.get(name, JiebaTokenizer())\n",
    "\n",
    "class TextAnalyzer:\n",
    "    def __init__(self, config_path='config.yaml'):\n",
    "        with open(config_path, 'r', encoding='utf-8') as f:\n",
    "            config = yaml.safe_load(f)\n",
    "        self.data_dir = config['data_dir']\n",
    "        self.top_n = config['top_n']\n",
    "        self.stop_words_file = config['stop_words_file']\n",
    "        self.output_file = config['output_file']\n",
    "        self.stop_words = self.load_stop_words()\n",
    "        self.word_count = Counter()\n",
    "        self.tokenizer = TokenizerFactory.create_tokenizer(config.get('tokenizer', 'jieba'))\n",
    "\n",
    "    # 其余方法同上"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a4db7046-dfe2-4bd8-81d1-49a42e2eeb5c",
   "metadata": {},
   "source": [
    "### 分析\n",
    "\n",
    "工程质量提升：\n",
    "   - 可维护性：分词器创建逻辑集中于工厂，易于修改。\n",
    "   - 可扩展性：添加新分词器只需更新工厂方法。\n",
    "\n",
    "适用场景：适合需要动态创建对象的场景。\n",
    "\n",
    "局限性：对于简单场景，工厂模式可能略显冗余。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "07158f09-703e-4abb-ac1a-881ba1b3b26d",
   "metadata": {},
   "source": [
    "## 附：元编程\n",
    "\n",
    "元编程允许动态修改类或函数行为，可用于动态配置分词器或输出格式。案例中，可通过元编程动态注册分词器。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4394008c-88da-44bd-aa0d-f1b7a6dbc7d6",
   "metadata": {},
   "outputs": [],
   "source": [
    "class TokenizerRegistry(type):\n",
    "    \"\"\"元类：动态注册分词器\"\"\"\n",
    "    tokenizers = {}\n",
    "\n",
    "    def register_tokenizer(cls, name):\n",
    "        def decorator(func):\n",
    "            cls.tokenizers[name] = func\n",
    "            return func\n",
    "        return decorator\n",
    "\n",
    "class TextAnalyzer(metaclass=TokenizerRegistry):\n",
    "    def __init__(self, config_path='config.yaml'):\n",
    "        with open(config_path, 'r', encoding='utf-8') as f:\n",
    "            config = yaml.safe_load(f)\n",
    "        self.data_dir = config['data_dir']\n",
    "        self.top_n = config['top_n']\n",
    "        self.stop_words_file = config['stop_words_file']\n",
    "        self.output_file = config['output_file']\n",
    "        self.stop_words = self.load_stop_words()\n",
    "        self.word_count = Counter()\n",
    "        self.tokenizer_name = config.get('tokenizer', 'jieba')\n",
    "\n",
    "    @classmethod\n",
    "    def register_tokenizer(cls, name):\n",
    "        return cls.__class__.register_tokenizer(name)\n",
    "\n",
    "    def tokenize(self, text: str) -> List[str]:\n",
    "        \"\"\"动态调用分词器\"\"\"\n",
    "        tokenizer = self.__class__.tokenizers.get(self.tokenizer_name)\n",
    "        return tokenizer(self, text)\n",
    "\n",
    "    @register_tokenizer('jieba')\n",
    "    def jieba_tokenizer(self, text: str) -> List[str]:\n",
    "        \"\"\"jieba 分词\"\"\"\n",
    "        return [w for w in jieba.lcut(text) if w not in self.stop_words]\n",
    "\n",
    "    @register_tokenizer('simple')\n",
    "    def simple_tokenizer(self, text: str) -> List[str]:\n",
    "        \"\"\"简单分词（按空格）\"\"\"\n",
    "        return [w for w in text.split() if w not in self.stop_words]\n",
    "\n",
    "    # 其余方法（load_stop_words, process_file, etc.）同上"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "30ba75ea-f769-4f90-9075-27670db9ada4",
   "metadata": {},
   "source": [
    "### 分析\n",
    "\n",
    "工程质量提升：\n",
    "- 可扩展性：新分词器只需添加新方法并注册，无需修改核心部分。\n",
    "- 灵活性：通过配置文件动态选择分词器。\n",
    "\n",
    "适用场景：适合需要动态配置或插件化系统的场景。\n",
    "\n",
    "局限性：元编程增加代码复杂性，需要团队整体技术能力支持 。"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}