You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

5.9 KiB

Session Handoff

Date: 2026-05-28

Repo State

  • Current branch: Shrink
  • Worktree is dirty; do not reset blindly.
  • Modified tracked files:
    • include/ir/IR.h
    • scripts/run_all_tests.sh
    • scripts/verify_asm.sh
    • scripts/verify_ir.sh
    • src/ir/analysis/DominatorTree.cpp
    • src/ir/analysis/LoopInfo.cpp
    • src/ir/passes/CMakeLists.txt
    • src/ir/passes/PassManager.cpp
    • src/main.cpp
    • src/mir/AsmPrinter.cpp
    • src/mir/Lowering.cpp
    • src/mir/MIRFunction.cpp
    • src/mir/passes/Peephole.cpp
    • sylib/sylib.c
  • New untracked files:
    • src/ir/passes/LICM.cpp
    • src/ir/passes/LoopFission.cpp
    • src/ir/passes/LoopIdiom.cpp
    • src/ir/passes/LoopParallelize.cpp
    • src/ir/passes/LoopPassUtils.h
    • src/ir/passes/LoopUnroll.cpp
    • src/ir/passes/StrengthReduction.cpp

Toolchain On Current Machine

  • cmake 3.22.1
  • g++ 11.4.0
  • clang 14.0.0
  • llc 14.0.0
  • aarch64-linux-gnu-gcc 11.4.0
  • qemu-aarch64 6.2.0

Required packages on a fresh Ubuntu:

sudo apt update
sudo apt install -y \
  build-essential \
  cmake \
  clang \
  llvm \
  gcc-aarch64-linux-gnu \
  qemu-user \
  libc6-arm64-cross

Important Build Detail

  • The repo vendors antlr4-runtime-4.13.2 in third_party, so no system ANTLR runtime install is needed.
  • Current frontend build consumes generated parser sources from build/generated/antlr4 if present.
  • There is also parser source in src/antlr4/, but current CMake does not wire that directory directly into the build.
  • Safest migration path: copy the repo together with the current build/generated/antlr4 directory, or later patch CMake to use src/antlr4/*.cpp.

Implemented IR / Loop Optimizations

Stable implemented items:

  • LICM
  • StrengthReduction
  • LoopFission
  • LoopUnroll
  • conservative LoopParallelization
  • LoopIdiom for constant-fill loops

Analysis infra already added:

  • DominatorTree
  • LoopInfo

Runtime support added:

  • pthread worker-pool based __par_runN in sylib/sylib.c
  • __fill_i32 helper in sylib/sylib.c

User constraints already decided:

  • Do not optimize the real-dependence matrix multiply in 2025-MYO-20 where A[i][j] is written and A[k][j] is read.
  • Reduction parallelization is still disabled.

Timing Scripts

Timing output was added to:

  • scripts/verify_ir.sh
  • scripts/verify_asm.sh
  • scripts/run_all_tests.sh

User requirement:

  • Every test round should always report:
    • test/test_case/performance/2025-MYO-20.sy
    • ./scripts/run_all_tests.sh --both

Recent ASM Correctness Fixes

Fixed issues:

  • AArch64 call lowering bug that could corrupt ABI argument registers due to W/X aliasing.
  • Duplicate local labels like .par.exit across worker functions by prefixing block labels with the function name.
  • Duplicate callee-saved save/restore of alias registers like w8/x8.

Relevant files:

  • src/mir/Lowering.cpp
  • src/mir/AsmPrinter.cpp
  • src/mir/MIRFunction.cpp

Recent ASM Optimization Work

Implemented recently:

  • post-regalloc second peephole pass in src/main.cpp
  • selective safe load forwarding guard for ABI argument registers
  • cbz/cbnz lowering for integer compare-against-zero in Cmp + CondBr fusion
  • dead overwrite elimination in peephole for adjacent load/compute that gets overwritten before use

Relevant files:

  • src/main.cpp
  • src/mir/Lowering.cpp
  • src/mir/passes/Peephole.cpp

Most Recent Measured Performance

These are the latest measured numbers observed during this session.

IR:

  • 2025-MYO-20 stable reference before latest ASM-only work:
    • around 31.109s
    • earlier stable reference before that: around 30.926s

ASM:

  • 02_mv3
    • earlier problematic run after correctness-only fix: about 31.662s
    • after later backend cleanup, best observed run in this session: about 31.505s
    • another later run: about 31.529s
  • 01_mm2
    • earlier reference in this session: about 38.010s
    • later improved run: about 37.346s

Interpretation:

  • ASM backend improvements are real but modest so far.
  • Main remaining bottleneck is still heavy stack traffic in hot loops.

Current Long-Running Item

  • A standalone 2025-MYO-20 ASM run was launched and had not finished at the time this handoff file was written.
  • A full ./scripts/run_all_tests.sh --both run had progressed to the final 2025-MYO-20 ASM item instead of failing early, but final completion time was still pending.

Good Commands To Resume Work

Build:

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j"$(nproc)" --target compiler

Quick correctness:

./scripts/verify_ir.sh test/test_case/functional/simple_add.sy /tmp/ir_check --run
./scripts/verify_asm.sh test/test_case/functional/simple_add.sy /tmp/asm_check --run

User-required fixed benchmarks:

./scripts/verify_ir.sh test/test_case/performance/2025-MYO-20.sy /tmp/timed_2025 --run
./scripts/run_all_tests.sh --both

Useful ASM profiling targets:

./scripts/verify_asm.sh test/test_case/performance/01_mm2.sy /tmp/asm_mm2 --run
./scripts/verify_asm.sh test/test_case/performance/02_mv3.sy /tmp/asm_mv3 --run
./scripts/verify_asm.sh test/test_case/performance/2025-MYO-20.sy /tmp/asm_2025 --run

Inspect generated assembly:

./build/bin/compiler --emit-asm test/test_case/performance/02_mv3.sy > /tmp/02_mv3.s
./build/bin/compiler --emit-asm test/test_case/performance/01_mm2.sy > /tmp/01_mm2.s
./build/bin/compiler --emit-asm test/test_case/performance/2025-MYO-20.sy > /tmp/2025.s

Suggested Next Steps

Priority order:

  1. Finish measuring 2025-MYO-20 ASM and a complete --both run on the faster Ubuntu machine.
  2. Keep working on MIR/ASM backend, not IR parallelization.
  3. Target hot-loop stack traffic:
    • reduce phi-related spill/reload churn
    • widen zero-compare branch simplification beyond the current fused path
    • add more dead store / dead load cleanup after frame lowering
  4. Only claim speedups when confirmed with the fixed benchmark pair above.