You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
p8n5m4fca 1e3c3b23ed
Delete 'dataset/fewshot_seed/Brain/2-shot.txt'
2 months ago
CLIP ADD file via upload 2 months ago
ckpt ADD file via upload 2 months ago
data ADD file via upload 2 months ago
dataset Delete 'dataset/fewshot_seed/Brain/2-shot.txt' 2 months ago
dataset%2Ffewshot_seed%2FBrain ADD file via upload 2 months ago
README.md ADD file via upload 2 months ago
ablation_test_zero.py ADD file via upload 2 months ago
ablation_train_zero.py ADD file via upload 2 months ago
loss.py ADD file via upload 2 months ago
prompt.py ADD file via upload 2 months ago
test_few.py ADD file via upload 2 months ago
test_zero.py ADD file via upload 2 months ago
train_few.py ADD file via upload 2 months ago
train_few_graph.py ADD file via upload 2 months ago
train_zero.py ADD file via upload 2 months ago
袁菱202325330126-于越202325330125.pdf ADD file via upload 2 months ago

README.md

Adapting Visual-Language Models for Generalizable Anomaly Detection in Medical Images

This is an official implementation of “Adapting Visual-Language Models for Generalizable Anomaly Detection in Medical Images” with PyTorch, accepted by CVPR 2024 (Highlight).

Paper Link

If our work is helpful for your research, please consider citing:

@inproceedings{huang2024adapting,
  title={Adapting Visual-Language Models for Generalizable Anomaly Detection in Medical Images}
  author={Huang, Chaoqin and Jiang, Aofan and Feng, Jinghao and Zhang, Ya and Wang, Xinchao and Wang, Yanfeng},
  booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2024}
}

Abstract: Recent advancements in large-scale visual-language pre-trained models have led to significant progress in zero-/few-shot anomaly detection within natural image domains. However, the substantial domain divergence between natural and medical images limits the effectiveness of these methodologies in medical anomaly detection. This paper introduces a novel lightweight multi-level adaptation and comparison framework to repurpose the CLIP model for medical anomaly detection. Our approach integrates multiple residual adapters into the pre-trained visual encoder, enabling a stepwise enhancement of visual features across different levels. This multi-level adaptation is guided by multi-level, pixel-wise visual-language feature alignment loss functions, which recalibrate the models focus from object semantics in natural imagery to anomaly identification in medical images. The adapted features exhibit improved generalization across various medical data types, even in zero-shot scenarios where the model encounters unseen medical modalities and anatomical regions during training. Our experiments on medical anomaly detection benchmarks demonstrate that our method significantly surpasses current state-of-the-art models, with an average AUC improvement of 6.24% and 7.33% for anomaly classification, 2.03% and 2.37% for anomaly segmentation, under the zero-shot and few-shot settings, respectively.

Keywords: Anomaly Detection, Medical Images

Get Started

Environment

  • python >= 3.8.5
  • pytorch >= 1.10.0
  • torchvision >= 0.11.1
  • numpy >= 1.19.2
  • scipy >= 1.5.2
  • kornia >= 0.6.1
  • pandas >= 1.1.3
  • opencv-python >= 4.5.4
  • pillow
  • tqdm
  • ftfy
  • regex

Pretrained model

Medical Anomaly Detection Benchmark

  1. (optional) Follow the BMAD to apply for permission to download the relevant dataset. After extracting the data, reorganize the data benchmark according to the guidelines provided in our Appendix A.

  2. We also provide the pre-processed benchmark. Please download the following dataset

  3. Place it within the master directory data and unzip the dataset.

    tar -xvf Liver.tar.gz
    tar -xvf Brain.tar.gz
    tar -xvf Histopathology_AD.tar.gz
    tar -xvf Retina_RESC.tar.gz
    tar -xvf Retina_OCT2017.tar.gz
    tar -xvf Chest.tar.gz
    

File Structure

After the preparation work, the whole project should have the following structure:

code
├─ ckpt
│  ├─ few-shot
│  └─ zero-shot
├─ CLIP
│  ├─ bpe_simple_vocab_16e6.txt.gz
│  ├─ ckpt
│  │  └─ ViT-L-14-336px.pt
│  ├─ clip.py
│  ├─ model.py
│  ├─ models.py
│  ├─ model_configs
│  │  └─ ViT-L-14-336.json
│  ├─ modified_resnet.py
│  ├─ openai.py
│  ├─ tokenizer.py
│  └─ transformer.py
├─ data
│  ├─ Brain_AD
│  │  ├─ valid
│  │  └─ test
│  ├─ ...
│  └─ Retina_RESC_AD
│     ├─ valid
│     └─ test
├─ dataset
│  ├─ fewshot_seed
│  │  ├─ Brain
│  │  ├─ ...
│  │  └─ Retina_RESC
│  ├─ medical_few.py
│  └─ medical_zero.py
├─ loss.py
├─ prompt.py
├─ readme.md
├─ train_few.py
├─ train_zero.py
└─ utils.py

Quick Start

python test_few.py --obj $target-object --shot $few-shot-number

For example, to test on the Brain MRI with k=4, simply run:

python test_few.py --obj Brain --shot 4

Training

python train_few.py --obj $target-object --shot $few-shot-number

For example, to train on the Brain MRI with k=4, simply run:

python train_few.py --obj Brain --shot 4

Results

Results of zero-shot anomaly detection and localization:

AUC (%) Detection Localization
Zero-shot Paper Inplementation Paper Inplementation
HIS 77.90 76.90 - -
ChestXray 71.11 71.11 - -
OCT17 95.40 95.40 - -
BrainMRI 78.63 79.80 90.27 89.68
LiverCT 76.24 81.18 97.85 97.93
RESC 83.31 88.99 92.05 90.44
Average 80.43 82.23 93.39 92.68

Results of few-shot anomaly detection and localization with k=4:

AUC (%) Detection Localization
4-shot Paper Inplementation Paper Inplementation
HIS 82.71 82.71 - -
ChestXray 81.95 81.95 - -
OCT17 99.38 99.38 - -
BrainMRI 92.44 92.31 97.30 97.30
LiverCT 81.18 81.18 99.73 99.69
RESC 96.18 96.18 98.97 98.97
Average 88.97 88.95 98.67 98.65
## Visualization

训练:多级特征适应


在CLIP的多个特征级别上应用CLIP适配器


在 CLIP 的多个特征级别上应用 CLIP 适配器Adapter是论文提出的MVFAMulti-level Visual Feature Adapter框架的核心设计,本质是在 CLIP 视觉编码器的不同特征阶段插入轻量级可学习模块,实现 “不微调主干网络、仅适配特征” 的目标,从而让预训练于自然图像的 CLIP 适配医疗图像异常检测任务。


视觉编码器的 “特征级别”

CLIP 的视觉编码器(以论文使用的 ViT-L/14 为例)采用分层结构,将图像从 “原始像素” 逐步转化为 “抽象特征”,论文将其划分为4 个特征级别StageS₁~S₄,每个级别对应不同抽象程度的特征,具体如下:

特征级别Stage 位置 特征抽象度 对应任务价值
S₁第 1 阶段) 编码器前 6 层输出 低(接近像素) 捕捉局部细节(如医疗图像中的微小病灶边缘)
S₂第 2 阶段) 编码器中间 6 层输出 平衡局部细节与全局结构(如肿瘤区域轮廓)
S₃第 3 阶段) 编码器后 6 层输出 捕捉全局语义(如脑部 MRI 的整体结构异常)
S₄第 4 阶段) 编码器最终输出 最高(全局特征) 图像级分类(如判断 “是否为异常图像”)

简单来说CLIP 的特征级别越靠后S₄特征越抽象、越偏向 “自然图像的物体语义”(如 “猫”“车” 的类别特征级别越靠前S₁特征越具体、越偏向 “像素级细节”—— 而医疗异常检测既需要 S₁~S₃的细节特征定位病灶分割任务也需要 S₄的全局特征判断图像是否异常分类任务这是 “多级别适配” 的核心需求。

论文不在单一特征级别(如仅 S₄应用适配器而是覆盖 S₁~S₃中间级别+ S₄最终级别

任务鸿沟:从 “物体语义识别” 到 “异常分辨”

CLIP 预训练的目标是 “识别自然图像中的物体类别”(如区分 “猫” 和 “车”),关注的是 “正常物体的语义特征”;而医疗异常检测需要 “分辨正常组织与异常病灶”(如区分 “正常脑组织” 和 “肿瘤”),关注的是 “局部细节偏差”。

  • 仅在 S₄全局特征应用适配器只能优化图像级分类无法捕捉像素级异常细节如小病灶
  • 多级别应用S₁~S₄S₁~S₃适配局部细节特征定位病灶S₄适配全局特征判断图像异常同时满足分类与分割需求。
域鸿沟:从 “自然图像” 到 “医疗图像”

自然图像(如风景、动物)与医疗图像(如 MRI、CT的风格、纹理、语义差异极大域偏移严重

  • 单一级别适配:仅能调整某一抽象度的特征,无法全面弥合域差异(如 S₄适配仅优化全局语义S₁适配仅优化像素细节
  • 多级别适配:通过 S₁~S₄的逐层适配让 CLIP 的每一层特征都逐步向医疗图像的 “正常 / 异常特征” 对齐(如 S₁适配让像素细节贴近 CT 的灰度分布S₄适配让全局语义从 “物体类别” 转向 “组织正常性”)。
泛化性提升:跨模态 / 解剖区域的零样本能力

核心目标是让模型在 “训练时未见过的医疗模态(如训练 MRI、测试 CT” 或 “解剖区域(如训练脑部、测试肝脏)” 上仍有效:

  • 多级别适配器通过 “逐层特征校准”,让每个级别都学习到 “通用的正常 / 异常特征模式”(而非某一模态的专属特征);
  • 对比实验显示:仅用多级别适配器(而非全局投影器)的模型,在零样本场景下分类 AUC 平均提升 13.57%Table 4证明多级别适配是泛化性的关键。
与 “单一级别适配” 的对比:多级别更优

论文通过消融实验Table 5验证了多级别应用的必要性

  • 单一级别适配(如仅 S₂最优的 S₂级别分类 AUC 为 88.84%、分割 AUC 为 98.62%
  • 多级别适配S₁~S₄融合分类 AUC 提升至 88.97%、分割 AUC 提升至 88.67%,且在所有数据集上均表现更稳定(避免单一级别对某类医疗图像的偏见)。

“在 CLIP 的多个特征级别上应用 CLIP 适配器”,本质是通过轻量级、残差式、多级别、双任务适配模块,在不破坏 CLIP 预训练知识的前提下,让 CLIP 的每一层特征都逐步从 “自然图像物体语义” 转向 “医疗图像异常特征”,最终实现 “零样本 / 少样本下跨模态、跨解剖区域的医疗异常检测与定位”,这也是论文 MVFA 框架超越其他 CLIP-based 方法(如 WinCLIP、April-GAN的核心创新点。