You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
nudt-compiler-cpp/SESSION_HANDOFF.md

212 lines
5.9 KiB

# Session Handoff
Date: 2026-05-28
## Repo State
- Current branch: `Shrink`
- Worktree is dirty; do not reset blindly.
- Modified tracked files:
- `include/ir/IR.h`
- `scripts/run_all_tests.sh`
- `scripts/verify_asm.sh`
- `scripts/verify_ir.sh`
- `src/ir/analysis/DominatorTree.cpp`
- `src/ir/analysis/LoopInfo.cpp`
- `src/ir/passes/CMakeLists.txt`
- `src/ir/passes/PassManager.cpp`
- `src/main.cpp`
- `src/mir/AsmPrinter.cpp`
- `src/mir/Lowering.cpp`
- `src/mir/MIRFunction.cpp`
- `src/mir/passes/Peephole.cpp`
- `sylib/sylib.c`
- New untracked files:
- `src/ir/passes/LICM.cpp`
- `src/ir/passes/LoopFission.cpp`
- `src/ir/passes/LoopIdiom.cpp`
- `src/ir/passes/LoopParallelize.cpp`
- `src/ir/passes/LoopPassUtils.h`
- `src/ir/passes/LoopUnroll.cpp`
- `src/ir/passes/StrengthReduction.cpp`
## Toolchain On Current Machine
- `cmake 3.22.1`
- `g++ 11.4.0`
- `clang 14.0.0`
- `llc 14.0.0`
- `aarch64-linux-gnu-gcc 11.4.0`
- `qemu-aarch64 6.2.0`
Required packages on a fresh Ubuntu:
```bash
sudo apt update
sudo apt install -y \
build-essential \
cmake \
clang \
llvm \
gcc-aarch64-linux-gnu \
qemu-user \
libc6-arm64-cross
```
## Important Build Detail
- The repo vendors `antlr4-runtime-4.13.2` in `third_party`, so no system ANTLR runtime install is needed.
- Current frontend build consumes generated parser sources from `build/generated/antlr4` if present.
- There is also parser source in `src/antlr4/`, but current CMake does not wire that directory directly into the build.
- Safest migration path: copy the repo together with the current `build/generated/antlr4` directory, or later patch CMake to use `src/antlr4/*.cpp`.
## Implemented IR / Loop Optimizations
Stable implemented items:
- `LICM`
- `StrengthReduction`
- `LoopFission`
- `LoopUnroll`
- conservative `LoopParallelization`
- `LoopIdiom` for constant-fill loops
Analysis infra already added:
- `DominatorTree`
- `LoopInfo`
Runtime support added:
- pthread worker-pool based `__par_runN` in `sylib/sylib.c`
- `__fill_i32` helper in `sylib/sylib.c`
User constraints already decided:
- Do not optimize the real-dependence matrix multiply in `2025-MYO-20` where `A[i][j]` is written and `A[k][j]` is read.
- Reduction parallelization is still disabled.
## Timing Scripts
Timing output was added to:
- `scripts/verify_ir.sh`
- `scripts/verify_asm.sh`
- `scripts/run_all_tests.sh`
User requirement:
- Every test round should always report:
- `test/test_case/performance/2025-MYO-20.sy`
- `./scripts/run_all_tests.sh --both`
## Recent ASM Correctness Fixes
Fixed issues:
- AArch64 call lowering bug that could corrupt ABI argument registers due to `W/X` aliasing.
- Duplicate local labels like `.par.exit` across worker functions by prefixing block labels with the function name.
- Duplicate callee-saved save/restore of alias registers like `w8/x8`.
Relevant files:
- `src/mir/Lowering.cpp`
- `src/mir/AsmPrinter.cpp`
- `src/mir/MIRFunction.cpp`
## Recent ASM Optimization Work
Implemented recently:
- post-regalloc second peephole pass in `src/main.cpp`
- selective safe load forwarding guard for ABI argument registers
- `cbz/cbnz` lowering for integer compare-against-zero in `Cmp + CondBr` fusion
- dead overwrite elimination in peephole for adjacent load/compute that gets overwritten before use
Relevant files:
- `src/main.cpp`
- `src/mir/Lowering.cpp`
- `src/mir/passes/Peephole.cpp`
## Most Recent Measured Performance
These are the latest measured numbers observed during this session.
IR:
- `2025-MYO-20` stable reference before latest ASM-only work:
- around `31.109s`
- earlier stable reference before that: around `30.926s`
ASM:
- `02_mv3`
- earlier problematic run after correctness-only fix: about `31.662s`
- after later backend cleanup, best observed run in this session: about `31.505s`
- another later run: about `31.529s`
- `01_mm2`
- earlier reference in this session: about `38.010s`
- later improved run: about `37.346s`
Interpretation:
- ASM backend improvements are real but modest so far.
- Main remaining bottleneck is still heavy stack traffic in hot loops.
## Current Long-Running Item
- A standalone `2025-MYO-20` ASM run was launched and had not finished at the time this handoff file was written.
- A full `./scripts/run_all_tests.sh --both` run had progressed to the final `2025-MYO-20` ASM item instead of failing early, but final completion time was still pending.
## Good Commands To Resume Work
Build:
```bash
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j"$(nproc)" --target compiler
```
Quick correctness:
```bash
./scripts/verify_ir.sh test/test_case/functional/simple_add.sy /tmp/ir_check --run
./scripts/verify_asm.sh test/test_case/functional/simple_add.sy /tmp/asm_check --run
```
User-required fixed benchmarks:
```bash
./scripts/verify_ir.sh test/test_case/performance/2025-MYO-20.sy /tmp/timed_2025 --run
./scripts/run_all_tests.sh --both
```
Useful ASM profiling targets:
```bash
./scripts/verify_asm.sh test/test_case/performance/01_mm2.sy /tmp/asm_mm2 --run
./scripts/verify_asm.sh test/test_case/performance/02_mv3.sy /tmp/asm_mv3 --run
./scripts/verify_asm.sh test/test_case/performance/2025-MYO-20.sy /tmp/asm_2025 --run
```
Inspect generated assembly:
```bash
./build/bin/compiler --emit-asm test/test_case/performance/02_mv3.sy > /tmp/02_mv3.s
./build/bin/compiler --emit-asm test/test_case/performance/01_mm2.sy > /tmp/01_mm2.s
./build/bin/compiler --emit-asm test/test_case/performance/2025-MYO-20.sy > /tmp/2025.s
```
## Suggested Next Steps
Priority order:
1. Finish measuring `2025-MYO-20` ASM and a complete `--both` run on the faster Ubuntu machine.
2. Keep working on MIR/ASM backend, not IR parallelization.
3. Target hot-loop stack traffic:
- reduce phi-related spill/reload churn
- widen zero-compare branch simplification beyond the current fused path
- add more dead store / dead load cleanup after frame lowering
4. Only claim speedups when confirmed with the fixed benchmark pair above.