forked from NUDT-compiler/nudt-compiler-cpp
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
5.9 KiB
5.9 KiB
Session Handoff
Date: 2026-05-28
Repo State
- Current branch:
Shrink - Worktree is dirty; do not reset blindly.
- Modified tracked files:
include/ir/IR.hscripts/run_all_tests.shscripts/verify_asm.shscripts/verify_ir.shsrc/ir/analysis/DominatorTree.cppsrc/ir/analysis/LoopInfo.cppsrc/ir/passes/CMakeLists.txtsrc/ir/passes/PassManager.cppsrc/main.cppsrc/mir/AsmPrinter.cppsrc/mir/Lowering.cppsrc/mir/MIRFunction.cppsrc/mir/passes/Peephole.cppsylib/sylib.c
- New untracked files:
src/ir/passes/LICM.cppsrc/ir/passes/LoopFission.cppsrc/ir/passes/LoopIdiom.cppsrc/ir/passes/LoopParallelize.cppsrc/ir/passes/LoopPassUtils.hsrc/ir/passes/LoopUnroll.cppsrc/ir/passes/StrengthReduction.cpp
Toolchain On Current Machine
cmake 3.22.1g++ 11.4.0clang 14.0.0llc 14.0.0aarch64-linux-gnu-gcc 11.4.0qemu-aarch64 6.2.0
Required packages on a fresh Ubuntu:
sudo apt update
sudo apt install -y \
build-essential \
cmake \
clang \
llvm \
gcc-aarch64-linux-gnu \
qemu-user \
libc6-arm64-cross
Important Build Detail
- The repo vendors
antlr4-runtime-4.13.2inthird_party, so no system ANTLR runtime install is needed. - Current frontend build consumes generated parser sources from
build/generated/antlr4if present. - There is also parser source in
src/antlr4/, but current CMake does not wire that directory directly into the build. - Safest migration path: copy the repo together with the current
build/generated/antlr4directory, or later patch CMake to usesrc/antlr4/*.cpp.
Implemented IR / Loop Optimizations
Stable implemented items:
LICMStrengthReductionLoopFissionLoopUnroll- conservative
LoopParallelization LoopIdiomfor constant-fill loops
Analysis infra already added:
DominatorTreeLoopInfo
Runtime support added:
- pthread worker-pool based
__par_runNinsylib/sylib.c __fill_i32helper insylib/sylib.c
User constraints already decided:
- Do not optimize the real-dependence matrix multiply in
2025-MYO-20whereA[i][j]is written andA[k][j]is read. - Reduction parallelization is still disabled.
Timing Scripts
Timing output was added to:
scripts/verify_ir.shscripts/verify_asm.shscripts/run_all_tests.sh
User requirement:
- Every test round should always report:
test/test_case/performance/2025-MYO-20.sy./scripts/run_all_tests.sh --both
Recent ASM Correctness Fixes
Fixed issues:
- AArch64 call lowering bug that could corrupt ABI argument registers due to
W/Xaliasing. - Duplicate local labels like
.par.exitacross worker functions by prefixing block labels with the function name. - Duplicate callee-saved save/restore of alias registers like
w8/x8.
Relevant files:
src/mir/Lowering.cppsrc/mir/AsmPrinter.cppsrc/mir/MIRFunction.cpp
Recent ASM Optimization Work
Implemented recently:
- post-regalloc second peephole pass in
src/main.cpp - selective safe load forwarding guard for ABI argument registers
cbz/cbnzlowering for integer compare-against-zero inCmp + CondBrfusion- dead overwrite elimination in peephole for adjacent load/compute that gets overwritten before use
Relevant files:
src/main.cppsrc/mir/Lowering.cppsrc/mir/passes/Peephole.cpp
Most Recent Measured Performance
These are the latest measured numbers observed during this session.
IR:
2025-MYO-20stable reference before latest ASM-only work:- around
31.109s - earlier stable reference before that: around
30.926s
- around
ASM:
02_mv3- earlier problematic run after correctness-only fix: about
31.662s - after later backend cleanup, best observed run in this session: about
31.505s - another later run: about
31.529s
- earlier problematic run after correctness-only fix: about
01_mm2- earlier reference in this session: about
38.010s - later improved run: about
37.346s
- earlier reference in this session: about
Interpretation:
- ASM backend improvements are real but modest so far.
- Main remaining bottleneck is still heavy stack traffic in hot loops.
Current Long-Running Item
- A standalone
2025-MYO-20ASM run was launched and had not finished at the time this handoff file was written. - A full
./scripts/run_all_tests.sh --bothrun had progressed to the final2025-MYO-20ASM item instead of failing early, but final completion time was still pending.
Good Commands To Resume Work
Build:
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j"$(nproc)" --target compiler
Quick correctness:
./scripts/verify_ir.sh test/test_case/functional/simple_add.sy /tmp/ir_check --run
./scripts/verify_asm.sh test/test_case/functional/simple_add.sy /tmp/asm_check --run
User-required fixed benchmarks:
./scripts/verify_ir.sh test/test_case/performance/2025-MYO-20.sy /tmp/timed_2025 --run
./scripts/run_all_tests.sh --both
Useful ASM profiling targets:
./scripts/verify_asm.sh test/test_case/performance/01_mm2.sy /tmp/asm_mm2 --run
./scripts/verify_asm.sh test/test_case/performance/02_mv3.sy /tmp/asm_mv3 --run
./scripts/verify_asm.sh test/test_case/performance/2025-MYO-20.sy /tmp/asm_2025 --run
Inspect generated assembly:
./build/bin/compiler --emit-asm test/test_case/performance/02_mv3.sy > /tmp/02_mv3.s
./build/bin/compiler --emit-asm test/test_case/performance/01_mm2.sy > /tmp/01_mm2.s
./build/bin/compiler --emit-asm test/test_case/performance/2025-MYO-20.sy > /tmp/2025.s
Suggested Next Steps
Priority order:
- Finish measuring
2025-MYO-20ASM and a complete--bothrun on the faster Ubuntu machine. - Keep working on MIR/ASM backend, not IR parallelization.
- Target hot-loop stack traffic:
- reduce phi-related spill/reload churn
- widen zero-compare branch simplification beyond the current fused path
- add more dead store / dead load cleanup after frame lowering
- Only claim speedups when confirmed with the fixed benchmark pair above.