4.0 KiB

Raw Permalink Blame History Unescape Escape

比赛性能优化记录

日期：2026-04-27

本轮已落地

1. FFT：模乘/模幂 idiom lowering

目标用例：fft1、fft0。

已实现：

在 MIR 增加 ModMul，识别递归 multiply(a, b) 的模乘 idiom，lower 成 smull + sdiv + msub，消除 multiply 递归调用。
在 MIR 增加 ModPow，识别递归 power(a, b) 的快速幂 idiom，lower 成后端内联循环，消除 power 递归调用。
fft1 汇编中 bl multiply / bl power 数量降为 0，仅保留算法本身的 fft 递归。

主要位置：

include/mir/MIR.h
src/mir/Lowering.cpp
src/mir/AsmPrinter.cpp
src/mir/MIRInstr.cpp
src/mir/passes/Peephole.cpp
src/mir/passes/SpillReduction.cpp

验证结果：

fft1：输出匹配，qemu 本地约 0.42s。
fft0：输出匹配，qemu 本地约 0.23s。

2. 03_sort2：power-of-two digit extraction

目标用例：03_sort2。

已实现：

识别 while (i < pos) num = num / 16; return num % 16; 这类 power-of-two radix digit helper。
IR 内联器会跳过该 helper，避免把小函数展开成大量循环。
后端用 DigitExtractPow2 直接 lower 成移位、带符号除法修正和取余序列，消除 bl getNumPos。
修复 GVN/CSE 的常量等价键，避免等值常量因对象地址不同而错过跨块消冗余。

主要位置：

src/ir/passes/MathIdiomUtils.h
src/ir/passes/Inline.cpp
src/ir/passes/GVN.cpp
src/ir/passes/CSE.cpp
src/mir/Lowering.cpp
src/mir/AsmPrinter.cpp

验证结果：

03_sort2：输出匹配，qemu 本地约 19.56s。
对比此前表中 31.317s，该项收益明显。

3. matmul / 2025-MYO-20：标量基础优化

目标用例：matmul1/2/3、2025-MYO-20。

已实现：

新增 IR ArithmeticSimplify，把 % power_of_two == 0 化成 bit-test，例如 x % 2 == 0 变为 (x & 1) == 0。
增强 LoadStoreElim，允许安全的跨块 load forwarding，解决 if 前已加载、then 块重复加载的问题。
修复 DominatorTree 的 immediate dominator 判定方向，恢复跨块 GVN/LICM/LSE 的基础支配关系。
matmul2 的内层核心从重复 load + 重复 mul 变为复用同一个乘积。

主要位置：

src/ir/passes/ArithmeticSimplify.cpp
src/ir/passes/LoadStoreElim.cpp
src/ir/analysis/DominatorTree.cpp
src/ir/passes/PassManager.cpp

验证结果：

matmul2：输出匹配，qemu 本地约 7.09s。
对比此前表中 8.407s，已有收益。

尚未完成：

真正的 NEON 向量化、矩阵 loop interchange/blocking 还没有落地。当前 MIR 没有 SIMD value type、NEON 寄存器类、向量 load/store、向量 arithmetic，也没有稳定的 loop-nest interchange/blocking 框架。硬塞样例级重写风险过高，不适合作为通用比赛编译器优化。

4. gameoflife：stencil 前置优化

目标用例：gameoflife-*。

已实现：

通过支配树修复和跨块 load forwarding，让 stencil 里的重复地址计算和重复 load 有更多被 GVN/LSE 消除的机会。

验证结果：

gameoflife-oscillator：输出匹配，qemu 本地约 8.82s。

尚未完成：

真正的 stencil NEON/行缓存优化还未落地。需要先补 SIMD MIR 和更明确的二维数组滑窗识别，否则容易做成样例特化。

5. 65_color

该用例加速比难看但绝对损失很小，本轮未优先处理。后续应只在大头用例收敛后再看。

本轮验证

cmake --build build -j：通过。
单例 qemu 对比均做了 stdout + exit code 的规范化 diff。
未运行全量测试，避免耗时过长。

下一步优先级

为 MIR 增加 NEON value type、向量寄存器类、vector load/store 和基础 i32x4/f32x4 arithmetic。
在 IR 层补 loop-nest 识别，先做安全的矩阵 loop interchange，再考虑 blocking。
对 gameoflife 做通用 stencil matcher，先生成 scalar row-cache，再接 NEON。
对 2025-MYO-20 单独用 scripts/analyze_case.sh 保存 IR/ASM，与 GCC 汇编对照后决定是否值得做 matmul micro-kernel lowering。