You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
CodePattern/D Plus/03 设计模式.ipynb

322 lines
11 KiB

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

{
"cells": [
{
"cell_type": "raw",
"id": "eccfe49f-de35-4241-90e3-a7095940b61a",
"metadata": {},
"source": [
"设计模式提供高频重复出现的需求的最佳解决方案。以下介绍适合词频统计案例的设计模式:策略模式、观察者模式、工厂模式。"
]
},
{
"cell_type": "markdown",
"id": "c186171f-d1f2-433e-a3eb-b266e2909a2c",
"metadata": {},
"source": [
"## 策略模式(动态选择分词策略)\n",
"\n",
"策略模式允许动态切换算法(如分词器),比元编程简单。"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "97c865cb-0b5a-4fa1-aa74-5ba2e65e7436",
"metadata": {},
"outputs": [],
"source": [
"from abc import ABC, abstractmethod\n",
"\n",
"class Tokenizer(ABC):\n",
" \"\"\"分词器接口\"\"\"\n",
" @abstractmethod\n",
" def tokenize(self, text: str, stop_words: set) -> List[str]:\n",
" pass\n",
"\n",
"class JiebaTokenizer(Tokenizer):\n",
" \"\"\"jieba 分词器\"\"\"\n",
" def tokenize(self, text: str, stop_words: set) -> List[str]:\n",
" return [w for w in jieba.lcut(text) if w not in stop_words]\n",
"\n",
"class SimpleTokenizer(Tokenizer):\n",
" \"\"\"简单分词器\"\"\"\n",
" def tokenize(self, text: str, stop_words: set) -> List[str]:\n",
" return [w for w in text.split() if w not in stop_words]\n",
"\n",
"class TextAnalyzer:\n",
" def __init__(self, config_path='config.yaml'):\n",
" with open(config_path, 'r', encoding='utf-8') as f:\n",
" config = yaml.safe_load(f)\n",
" self.data_dir = config['data_dir']\n",
" self.top_n = config['top_n']\n",
" self.stop_words_file = config['stop_words_file']\n",
" self.output_file = config['output_file']\n",
" self.stop_words = self.load_stop_words()\n",
" self.word_count = Counter()\n",
" # 动态选择分词器\n",
" tokenizer_name = config.get('tokenizer', 'jieba')\n",
" self.tokenizer = {'jieba': JiebaTokenizer(), 'simple': SimpleTokenizer()}[tokenizer_name]\n",
"\n",
" def tokenize(self, text: str) -> List[str]:\n",
" \"\"\"使用策略分词\"\"\"\n",
" return self.tokenizer.tokenize(text, self.stop_words)\n",
"\n",
" # 其余方法同上"
]
},
{
"cell_type": "markdown",
"id": "5435ebc3-d3b0-4475-8bd5-cb45fb51638c",
"metadata": {},
"source": [
"工程质量提升:\n",
"- 可扩展性:添加新分词器只需实现 Tokenizer 接口。\n",
"- 可维护性:分词逻辑与主类分离,修改更独立。\n",
"\n",
"适用场景:适合需要动态切换算法的场景。"
]
},
{
"cell_type": "markdown",
"id": "fbf53455-558c-40fb-8718-446dec989b5d",
"metadata": {},
"source": [
"## 观察者模式(结果输出解耦)\n",
"\n",
"观察者模式可用于解耦结果输出逻辑(如打印、保存文件、发送通知)。"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d7a2bd4c-df73-4800-b45b-9b6c73d28d7b",
"metadata": {},
"outputs": [],
"source": [
"class OutputObserver(ABC):\n",
" \"\"\"输出观察者接口\"\"\"\n",
" @abstractmethod\n",
" def update(self, top_words: List[Tuple[str, int]]):\n",
" pass\n",
"\n",
"class ConsoleOutput(OutputObserver):\n",
" \"\"\"控制台输出\"\"\"\n",
" def update(self, top_words: List[Tuple[str, int]]):\n",
" for word, count in top_words:\n",
" print(f\"{word}: {count}\")\n",
"\n",
"class FileOutput(OutputObserver):\n",
" \"\"\"文件输出\"\"\"\n",
" def __init__(self, output_file: str):\n",
" self.output_file = output_file\n",
"\n",
" def update(self, top_words: List[Tuple[str, int]]):\n",
" with open(self.output_file, 'w', encoding='utf-8') as f:\n",
" for word, count in top_words:\n",
" f.write(f\"{word}: {count}\\n\")\n",
"\n",
"class TextAnalyzer:\n",
" def __init__(self, config_path='config.yaml'):\n",
" with open(config_path, 'r', encoding='utf-8') as f:\n",
" config = yaml.safe_load(f)\n",
" self.data_dir = config['data_dir']\n",
" self.top_n = config['top_n']\n",
" self.stop_words_file = config['stop_words_file']\n",
" self.output_file = config['output_file']\n",
" self.stop_words = self.load_stop_words()\n",
" self.word_count = Counter()\n",
" self.observers = [ConsoleOutput(), FileOutput(self.output_file)]\n",
"\n",
" def add_observer(self, observer: OutputObserver):\n",
" \"\"\"添加观察者\"\"\"\n",
" self.observers.append(observer)\n",
"\n",
" def notify_observers(self, top_words: List[Tuple[str, int]]):\n",
" \"\"\"通知所有观察者\"\"\"\n",
" for observer in self.observers:\n",
" observer.update(top_words)\n",
"\n",
" def run(self):\n",
" \"\"\"执行词频统计并通知观察者\"\"\"\n",
" self.process_directory()\n",
" top_words = self.get_top_words()\n",
" self.notify_observers(top_words)\n",
"\n",
" # 其余方法同上"
]
},
{
"cell_type": "markdown",
"id": "02b5cfba-431c-4a01-a454-099e4f41922c",
"metadata": {},
"source": [
"### 分析\n",
"\n",
"工程质量提升:\n",
" - 可扩展性:添加新输出方式只需实现 OutputObserver 接口。\n",
" - 解耦性:输出逻辑与统计逻辑分离,修改输出不影响核心功能。\n",
"\n",
"适用场景:适合需要多种输出或通知的场景。\n",
"\n",
"局限性:观察者模式增加代码复杂性,适合复杂输出需求。"
]
},
{
"cell_type": "markdown",
"id": "11669305-8cd5-4317-afd5-e85c3f0a5a81",
"metadata": {},
"source": [
"## 工厂模式(动态创建分词器)\n",
"\n",
"工厂模式可用于动态创建分词器,简化策略模式中的初始化逻辑。"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2fa50633-de22-40c8-912d-3ded5ebcedfc",
"metadata": {},
"outputs": [],
"source": [
"class TokenizerFactory:\n",
" \"\"\"分词器工厂\"\"\"\n",
" @staticmethod\n",
" def create_tokenizer(name: str) -> Tokenizer:\n",
" tokenizers = {\n",
" 'jieba': JiebaTokenizer(),\n",
" 'simple': SimpleTokenizer()\n",
" }\n",
" return tokenizers.get(name, JiebaTokenizer())\n",
"\n",
"class TextAnalyzer:\n",
" def __init__(self, config_path='config.yaml'):\n",
" with open(config_path, 'r', encoding='utf-8') as f:\n",
" config = yaml.safe_load(f)\n",
" self.data_dir = config['data_dir']\n",
" self.top_n = config['top_n']\n",
" self.stop_words_file = config['stop_words_file']\n",
" self.output_file = config['output_file']\n",
" self.stop_words = self.load_stop_words()\n",
" self.word_count = Counter()\n",
" self.tokenizer = TokenizerFactory.create_tokenizer(config.get('tokenizer', 'jieba'))\n",
"\n",
" # 其余方法同上"
]
},
{
"cell_type": "markdown",
"id": "a4db7046-dfe2-4bd8-81d1-49a42e2eeb5c",
"metadata": {},
"source": [
"### 分析\n",
"\n",
"工程质量提升:\n",
" - 可维护性:分词器创建逻辑集中于工厂,易于修改。\n",
" - 可扩展性:添加新分词器只需更新工厂方法。\n",
"\n",
"适用场景:适合需要动态创建对象的场景。\n",
"\n",
"局限性:对于简单场景,工厂模式可能略显冗余。"
]
},
{
"cell_type": "markdown",
"id": "07158f09-703e-4abb-ac1a-881ba1b3b26d",
"metadata": {},
"source": [
"## 附:元编程\n",
"\n",
"元编程允许动态修改类或函数行为,可用于动态配置分词器或输出格式。案例中,可通过元编程动态注册分词器。"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4394008c-88da-44bd-aa0d-f1b7a6dbc7d6",
"metadata": {},
"outputs": [],
"source": [
"class TokenizerRegistry(type):\n",
" \"\"\"元类:动态注册分词器\"\"\"\n",
" tokenizers = {}\n",
"\n",
" def register_tokenizer(cls, name):\n",
" def decorator(func):\n",
" cls.tokenizers[name] = func\n",
" return func\n",
" return decorator\n",
"\n",
"class TextAnalyzer(metaclass=TokenizerRegistry):\n",
" def __init__(self, config_path='config.yaml'):\n",
" with open(config_path, 'r', encoding='utf-8') as f:\n",
" config = yaml.safe_load(f)\n",
" self.data_dir = config['data_dir']\n",
" self.top_n = config['top_n']\n",
" self.stop_words_file = config['stop_words_file']\n",
" self.output_file = config['output_file']\n",
" self.stop_words = self.load_stop_words()\n",
" self.word_count = Counter()\n",
" self.tokenizer_name = config.get('tokenizer', 'jieba')\n",
"\n",
" @classmethod\n",
" def register_tokenizer(cls, name):\n",
" return cls.__class__.register_tokenizer(name)\n",
"\n",
" def tokenize(self, text: str) -> List[str]:\n",
" \"\"\"动态调用分词器\"\"\"\n",
" tokenizer = self.__class__.tokenizers.get(self.tokenizer_name)\n",
" return tokenizer(self, text)\n",
"\n",
" @register_tokenizer('jieba')\n",
" def jieba_tokenizer(self, text: str) -> List[str]:\n",
" \"\"\"jieba 分词\"\"\"\n",
" return [w for w in jieba.lcut(text) if w not in self.stop_words]\n",
"\n",
" @register_tokenizer('simple')\n",
" def simple_tokenizer(self, text: str) -> List[str]:\n",
" \"\"\"简单分词(按空格)\"\"\"\n",
" return [w for w in text.split() if w not in self.stop_words]\n",
"\n",
" # 其余方法load_stop_words, process_file, etc.)同上"
]
},
{
"cell_type": "markdown",
"id": "30ba75ea-f769-4f90-9075-27670db9ada4",
"metadata": {},
"source": [
"### 分析\n",
"\n",
"工程质量提升:\n",
"- 可扩展性:新分词器只需添加新方法并注册,无需修改核心部分。\n",
"- 灵活性:通过配置文件动态选择分词器。\n",
"\n",
"适用场景:适合需要动态配置或插件化系统的场景。\n",
"\n",
"局限性:元编程增加代码复杂性,需要团队整体技术能力支持 。"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}