# Session Handoff Date: 2026-05-28 ## Repo State - Current branch: `Shrink` - Worktree is dirty; do not reset blindly. - Modified tracked files: - `include/ir/IR.h` - `scripts/run_all_tests.sh` - `scripts/verify_asm.sh` - `scripts/verify_ir.sh` - `src/ir/analysis/DominatorTree.cpp` - `src/ir/analysis/LoopInfo.cpp` - `src/ir/passes/CMakeLists.txt` - `src/ir/passes/PassManager.cpp` - `src/main.cpp` - `src/mir/AsmPrinter.cpp` - `src/mir/Lowering.cpp` - `src/mir/MIRFunction.cpp` - `src/mir/passes/Peephole.cpp` - `sylib/sylib.c` - New untracked files: - `src/ir/passes/LICM.cpp` - `src/ir/passes/LoopFission.cpp` - `src/ir/passes/LoopIdiom.cpp` - `src/ir/passes/LoopParallelize.cpp` - `src/ir/passes/LoopPassUtils.h` - `src/ir/passes/LoopUnroll.cpp` - `src/ir/passes/StrengthReduction.cpp` ## Toolchain On Current Machine - `cmake 3.22.1` - `g++ 11.4.0` - `clang 14.0.0` - `llc 14.0.0` - `aarch64-linux-gnu-gcc 11.4.0` - `qemu-aarch64 6.2.0` Required packages on a fresh Ubuntu: ```bash sudo apt update sudo apt install -y \ build-essential \ cmake \ clang \ llvm \ gcc-aarch64-linux-gnu \ qemu-user \ libc6-arm64-cross ``` ## Important Build Detail - The repo vendors `antlr4-runtime-4.13.2` in `third_party`, so no system ANTLR runtime install is needed. - Current frontend build consumes generated parser sources from `build/generated/antlr4` if present. - There is also parser source in `src/antlr4/`, but current CMake does not wire that directory directly into the build. - Safest migration path: copy the repo together with the current `build/generated/antlr4` directory, or later patch CMake to use `src/antlr4/*.cpp`. ## Implemented IR / Loop Optimizations Stable implemented items: - `LICM` - `StrengthReduction` - `LoopFission` - `LoopUnroll` - conservative `LoopParallelization` - `LoopIdiom` for constant-fill loops Analysis infra already added: - `DominatorTree` - `LoopInfo` Runtime support added: - pthread worker-pool based `__par_runN` in `sylib/sylib.c` - `__fill_i32` helper in `sylib/sylib.c` User constraints already decided: - Do not optimize the real-dependence matrix multiply in `2025-MYO-20` where `A[i][j]` is written and `A[k][j]` is read. - Reduction parallelization is still disabled. ## Timing Scripts Timing output was added to: - `scripts/verify_ir.sh` - `scripts/verify_asm.sh` - `scripts/run_all_tests.sh` User requirement: - Every test round should always report: - `test/test_case/performance/2025-MYO-20.sy` - `./scripts/run_all_tests.sh --both` ## Recent ASM Correctness Fixes Fixed issues: - AArch64 call lowering bug that could corrupt ABI argument registers due to `W/X` aliasing. - Duplicate local labels like `.par.exit` across worker functions by prefixing block labels with the function name. - Duplicate callee-saved save/restore of alias registers like `w8/x8`. Relevant files: - `src/mir/Lowering.cpp` - `src/mir/AsmPrinter.cpp` - `src/mir/MIRFunction.cpp` ## Recent ASM Optimization Work Implemented recently: - post-regalloc second peephole pass in `src/main.cpp` - selective safe load forwarding guard for ABI argument registers - `cbz/cbnz` lowering for integer compare-against-zero in `Cmp + CondBr` fusion - dead overwrite elimination in peephole for adjacent load/compute that gets overwritten before use Relevant files: - `src/main.cpp` - `src/mir/Lowering.cpp` - `src/mir/passes/Peephole.cpp` ## Most Recent Measured Performance These are the latest measured numbers observed during this session. IR: - `2025-MYO-20` stable reference before latest ASM-only work: - around `31.109s` - earlier stable reference before that: around `30.926s` ASM: - `02_mv3` - earlier problematic run after correctness-only fix: about `31.662s` - after later backend cleanup, best observed run in this session: about `31.505s` - another later run: about `31.529s` - `01_mm2` - earlier reference in this session: about `38.010s` - later improved run: about `37.346s` Interpretation: - ASM backend improvements are real but modest so far. - Main remaining bottleneck is still heavy stack traffic in hot loops. ## Current Long-Running Item - A standalone `2025-MYO-20` ASM run was launched and had not finished at the time this handoff file was written. - A full `./scripts/run_all_tests.sh --both` run had progressed to the final `2025-MYO-20` ASM item instead of failing early, but final completion time was still pending. ## Good Commands To Resume Work Build: ```bash cmake -S . -B build -DCMAKE_BUILD_TYPE=Release cmake --build build -j"$(nproc)" --target compiler ``` Quick correctness: ```bash ./scripts/verify_ir.sh test/test_case/functional/simple_add.sy /tmp/ir_check --run ./scripts/verify_asm.sh test/test_case/functional/simple_add.sy /tmp/asm_check --run ``` User-required fixed benchmarks: ```bash ./scripts/verify_ir.sh test/test_case/performance/2025-MYO-20.sy /tmp/timed_2025 --run ./scripts/run_all_tests.sh --both ``` Useful ASM profiling targets: ```bash ./scripts/verify_asm.sh test/test_case/performance/01_mm2.sy /tmp/asm_mm2 --run ./scripts/verify_asm.sh test/test_case/performance/02_mv3.sy /tmp/asm_mv3 --run ./scripts/verify_asm.sh test/test_case/performance/2025-MYO-20.sy /tmp/asm_2025 --run ``` Inspect generated assembly: ```bash ./build/bin/compiler --emit-asm test/test_case/performance/02_mv3.sy > /tmp/02_mv3.s ./build/bin/compiler --emit-asm test/test_case/performance/01_mm2.sy > /tmp/01_mm2.s ./build/bin/compiler --emit-asm test/test_case/performance/2025-MYO-20.sy > /tmp/2025.s ``` ## Suggested Next Steps Priority order: 1. Finish measuring `2025-MYO-20` ASM and a complete `--both` run on the faster Ubuntu machine. 2. Keep working on MIR/ASM backend, not IR parallelization. 3. Target hot-loop stack traffic: - reduce phi-related spill/reload churn - widen zero-compare branch simplification beyond the current fused path - add more dead store / dead load cleanup after frame lowering 4. Only claim speedups when confirmed with the fixed benchmark pair above.