forked from NUDT-compiler/nudt-compiler-cpp
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
212 lines
5.9 KiB
212 lines
5.9 KiB
# Session Handoff
|
|
|
|
Date: 2026-05-28
|
|
|
|
## Repo State
|
|
|
|
- Current branch: `Shrink`
|
|
- Worktree is dirty; do not reset blindly.
|
|
- Modified tracked files:
|
|
- `include/ir/IR.h`
|
|
- `scripts/run_all_tests.sh`
|
|
- `scripts/verify_asm.sh`
|
|
- `scripts/verify_ir.sh`
|
|
- `src/ir/analysis/DominatorTree.cpp`
|
|
- `src/ir/analysis/LoopInfo.cpp`
|
|
- `src/ir/passes/CMakeLists.txt`
|
|
- `src/ir/passes/PassManager.cpp`
|
|
- `src/main.cpp`
|
|
- `src/mir/AsmPrinter.cpp`
|
|
- `src/mir/Lowering.cpp`
|
|
- `src/mir/MIRFunction.cpp`
|
|
- `src/mir/passes/Peephole.cpp`
|
|
- `sylib/sylib.c`
|
|
- New untracked files:
|
|
- `src/ir/passes/LICM.cpp`
|
|
- `src/ir/passes/LoopFission.cpp`
|
|
- `src/ir/passes/LoopIdiom.cpp`
|
|
- `src/ir/passes/LoopParallelize.cpp`
|
|
- `src/ir/passes/LoopPassUtils.h`
|
|
- `src/ir/passes/LoopUnroll.cpp`
|
|
- `src/ir/passes/StrengthReduction.cpp`
|
|
|
|
## Toolchain On Current Machine
|
|
|
|
- `cmake 3.22.1`
|
|
- `g++ 11.4.0`
|
|
- `clang 14.0.0`
|
|
- `llc 14.0.0`
|
|
- `aarch64-linux-gnu-gcc 11.4.0`
|
|
- `qemu-aarch64 6.2.0`
|
|
|
|
Required packages on a fresh Ubuntu:
|
|
|
|
```bash
|
|
sudo apt update
|
|
sudo apt install -y \
|
|
build-essential \
|
|
cmake \
|
|
clang \
|
|
llvm \
|
|
gcc-aarch64-linux-gnu \
|
|
qemu-user \
|
|
libc6-arm64-cross
|
|
```
|
|
|
|
## Important Build Detail
|
|
|
|
- The repo vendors `antlr4-runtime-4.13.2` in `third_party`, so no system ANTLR runtime install is needed.
|
|
- Current frontend build consumes generated parser sources from `build/generated/antlr4` if present.
|
|
- There is also parser source in `src/antlr4/`, but current CMake does not wire that directory directly into the build.
|
|
- Safest migration path: copy the repo together with the current `build/generated/antlr4` directory, or later patch CMake to use `src/antlr4/*.cpp`.
|
|
|
|
## Implemented IR / Loop Optimizations
|
|
|
|
Stable implemented items:
|
|
|
|
- `LICM`
|
|
- `StrengthReduction`
|
|
- `LoopFission`
|
|
- `LoopUnroll`
|
|
- conservative `LoopParallelization`
|
|
- `LoopIdiom` for constant-fill loops
|
|
|
|
Analysis infra already added:
|
|
|
|
- `DominatorTree`
|
|
- `LoopInfo`
|
|
|
|
Runtime support added:
|
|
|
|
- pthread worker-pool based `__par_runN` in `sylib/sylib.c`
|
|
- `__fill_i32` helper in `sylib/sylib.c`
|
|
|
|
User constraints already decided:
|
|
|
|
- Do not optimize the real-dependence matrix multiply in `2025-MYO-20` where `A[i][j]` is written and `A[k][j]` is read.
|
|
- Reduction parallelization is still disabled.
|
|
|
|
## Timing Scripts
|
|
|
|
Timing output was added to:
|
|
|
|
- `scripts/verify_ir.sh`
|
|
- `scripts/verify_asm.sh`
|
|
- `scripts/run_all_tests.sh`
|
|
|
|
User requirement:
|
|
|
|
- Every test round should always report:
|
|
- `test/test_case/performance/2025-MYO-20.sy`
|
|
- `./scripts/run_all_tests.sh --both`
|
|
|
|
## Recent ASM Correctness Fixes
|
|
|
|
Fixed issues:
|
|
|
|
- AArch64 call lowering bug that could corrupt ABI argument registers due to `W/X` aliasing.
|
|
- Duplicate local labels like `.par.exit` across worker functions by prefixing block labels with the function name.
|
|
- Duplicate callee-saved save/restore of alias registers like `w8/x8`.
|
|
|
|
Relevant files:
|
|
|
|
- `src/mir/Lowering.cpp`
|
|
- `src/mir/AsmPrinter.cpp`
|
|
- `src/mir/MIRFunction.cpp`
|
|
|
|
## Recent ASM Optimization Work
|
|
|
|
Implemented recently:
|
|
|
|
- post-regalloc second peephole pass in `src/main.cpp`
|
|
- selective safe load forwarding guard for ABI argument registers
|
|
- `cbz/cbnz` lowering for integer compare-against-zero in `Cmp + CondBr` fusion
|
|
- dead overwrite elimination in peephole for adjacent load/compute that gets overwritten before use
|
|
|
|
Relevant files:
|
|
|
|
- `src/main.cpp`
|
|
- `src/mir/Lowering.cpp`
|
|
- `src/mir/passes/Peephole.cpp`
|
|
|
|
## Most Recent Measured Performance
|
|
|
|
These are the latest measured numbers observed during this session.
|
|
|
|
IR:
|
|
|
|
- `2025-MYO-20` stable reference before latest ASM-only work:
|
|
- around `31.109s`
|
|
- earlier stable reference before that: around `30.926s`
|
|
|
|
ASM:
|
|
|
|
- `02_mv3`
|
|
- earlier problematic run after correctness-only fix: about `31.662s`
|
|
- after later backend cleanup, best observed run in this session: about `31.505s`
|
|
- another later run: about `31.529s`
|
|
- `01_mm2`
|
|
- earlier reference in this session: about `38.010s`
|
|
- later improved run: about `37.346s`
|
|
|
|
Interpretation:
|
|
|
|
- ASM backend improvements are real but modest so far.
|
|
- Main remaining bottleneck is still heavy stack traffic in hot loops.
|
|
|
|
## Current Long-Running Item
|
|
|
|
- A standalone `2025-MYO-20` ASM run was launched and had not finished at the time this handoff file was written.
|
|
- A full `./scripts/run_all_tests.sh --both` run had progressed to the final `2025-MYO-20` ASM item instead of failing early, but final completion time was still pending.
|
|
|
|
## Good Commands To Resume Work
|
|
|
|
Build:
|
|
|
|
```bash
|
|
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
|
|
cmake --build build -j"$(nproc)" --target compiler
|
|
```
|
|
|
|
Quick correctness:
|
|
|
|
```bash
|
|
./scripts/verify_ir.sh test/test_case/functional/simple_add.sy /tmp/ir_check --run
|
|
./scripts/verify_asm.sh test/test_case/functional/simple_add.sy /tmp/asm_check --run
|
|
```
|
|
|
|
User-required fixed benchmarks:
|
|
|
|
```bash
|
|
./scripts/verify_ir.sh test/test_case/performance/2025-MYO-20.sy /tmp/timed_2025 --run
|
|
./scripts/run_all_tests.sh --both
|
|
```
|
|
|
|
Useful ASM profiling targets:
|
|
|
|
```bash
|
|
./scripts/verify_asm.sh test/test_case/performance/01_mm2.sy /tmp/asm_mm2 --run
|
|
./scripts/verify_asm.sh test/test_case/performance/02_mv3.sy /tmp/asm_mv3 --run
|
|
./scripts/verify_asm.sh test/test_case/performance/2025-MYO-20.sy /tmp/asm_2025 --run
|
|
```
|
|
|
|
Inspect generated assembly:
|
|
|
|
```bash
|
|
./build/bin/compiler --emit-asm test/test_case/performance/02_mv3.sy > /tmp/02_mv3.s
|
|
./build/bin/compiler --emit-asm test/test_case/performance/01_mm2.sy > /tmp/01_mm2.s
|
|
./build/bin/compiler --emit-asm test/test_case/performance/2025-MYO-20.sy > /tmp/2025.s
|
|
```
|
|
|
|
## Suggested Next Steps
|
|
|
|
Priority order:
|
|
|
|
1. Finish measuring `2025-MYO-20` ASM and a complete `--both` run on the faster Ubuntu machine.
|
|
2. Keep working on MIR/ASM backend, not IR parallelization.
|
|
3. Target hot-loop stack traffic:
|
|
- reduce phi-related spill/reload churn
|
|
- widen zero-compare branch simplification beyond the current fused path
|
|
- add more dead store / dead load cleanup after frame lowering
|
|
4. Only claim speedups when confirmed with the fixed benchmark pair above.
|