Compare commits
2 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| 6f48016c10 | |||
| 4475e91bd8 |
91
doc/Lab5-实验记录.md
Normal file
91
doc/Lab5-实验记录.md
Normal file
@@ -0,0 +1,91 @@
|
||||
# Lab5 实验记录:寄存器分配与后端窥孔优化
|
||||
|
||||
## 1. 实验目标
|
||||
|
||||
本次 Lab5 的核心目标是在已有的中间表示生成与汇编生成框架基础上,实现高效的寄存器分配与后端优化技术。
|
||||
|
||||
本次完成工作的重点包括:
|
||||
- 在汇编代码生成(AArch64)的框架下,理解并适配从虚拟寄存器到物理寄存器的分配管理(Linear Scan 或基本图着色)。
|
||||
- 实现后端窥孔优化(Peephole Optimization),消除冗余的寄存器 move 指令(如 `mov w8, w8`)和多余的栈加载/存储指令(如 redundant Load-after-Store)。
|
||||
- 处理 AArch64 寄存器别名(W 寄存器与 X 寄存器)以及浮点/通用寄存器的交互边界,解决浮点常数加载的副作用。
|
||||
- 通过全面的功能测试套件(`verify_asm.sh`)以保证生成的汇编在 QEMU 模拟器环境下的正确运行。
|
||||
|
||||
## 2. 代码改动范围
|
||||
|
||||
本次实验主要涉及和修改了以下模块:
|
||||
- `include/mir/MIR.h`:增加 `RunPeephole` 优化通路的函数声明。
|
||||
- `src/mir/passes/Peephole.cpp`:实现完整的后端窥孔优化处理器,包括寄存器尺度匹配、寄存器别名正规化以及栈读写冗余消除。
|
||||
- `src/main.cpp`:将后端优化入口 `RunPeephole` 插入到汇编生成的整个管线中。
|
||||
- 新增文档:`doc/Lab5-实验记录.md`。
|
||||
|
||||
## 3. 完成过程
|
||||
|
||||
### 3.1 问题边界定位与痛点分析
|
||||
|
||||
在进行后端优化与窥孔之前,编译器能够正常输出 AArch64 汇编。但是由于寄存器分配和栈槽管理的保守性,生成的汇编代码中充斥着大量的:
|
||||
1. 冗余的同名寄存器 self-move(如 `mov w9, w9`,`mov x8, x8`)。
|
||||
2. 在溢出与重载场景中,大量的 `StoreStack` 后紧跟 `LoadStack` 到相同物理寄存器的冗余操作。
|
||||
3. 浮点数常量在 AArch64 后端加载时,通常需要通过常数池(`adrp` + `ldr`)加载,在此过程中需要临时占用通用寄存器(如 `x8`/`w8`)。
|
||||
|
||||
如果窥孔优化对 AArch64 的通用寄存器别名(Wn 对应 Xn 的低 32 位)和隐式寄存器改写认知不够清晰,就会导致错误的优化,使得浮点数表达式比较时生成错误的汇编,进而在 QEMU 中引发 Segment Fault 或结果不匹配。
|
||||
|
||||
### 3.2 窥孔优化的具体设计与实现
|
||||
|
||||
为了保证性能与正确性,本实验在 `src/mir/passes/Peephole.cpp` 中设计了基于数据流上下文的单块窥孔扫描机制:
|
||||
|
||||
1. **同名物理寄存器正规化(NormalizeReg)**:
|
||||
AArch64 下,`W0` 到 `W28` 与 `X0` 到 `X28` 是一对一重叠映射的。在做跟踪和消除 redundant Load-after-Store 时,必须将 64 位寄存器统一转换为 32 位别名正规化处理,避免因为指令尺寸不同(W vs X)导致寄存器别名追踪失效。
|
||||
|
||||
2. **寄存器大小动态适配(MatchRegSize)**:
|
||||
在做 `LoadStack` 替换为 `MovReg` 时,如果源寄存器是 64 位的(如 X9)而目标寄存器是 32 位的(如 W0),不能直接生成 `mov w0, x9`。必须调用 `MatchRegSize` 动态判断并裁剪为相同尺寸的 `mov w0, w9`,确保生成的汇编指令能够通过 GNU 汇编器编译。
|
||||
|
||||
3. **隐式写寄存器的追踪**:
|
||||
识别后端中隐式读写 `x8`/`w8` 临时寄存器的指令(例如浮点 `MovImm`),并在窥孔器扫描到此类指令时,主动失效被覆盖寄存器的活动跟踪状态,解决由此导致的寄存器污染问题。
|
||||
|
||||
## 4. 关键困难与解决办法
|
||||
|
||||
### 4.1 困难一:浮点常数隐式加载改写寄存器的副作用
|
||||
|
||||
#### 现象
|
||||
|
||||
在浮点测试用例 `95_float.sy` 进行编译时,发现部分浮点比较的结果不正确。经跟踪发现,浮点 `MovImm` 最终会被翻译为通过 PC 相对寻址(`adrp` + `ldr`)加载 `rodata`,该过程会隐式使用通用寄存器 `x8`/`w8`,而这会破坏正在被跟踪的 `x8`/`w8` 值。
|
||||
|
||||
#### 解决办法
|
||||
|
||||
在 `Peephole.cpp` 的指令写失效扫描逻辑中,显式识别 `MovImm` 的目标寄存器类型。如果目标寄存器是浮点寄存器(`S0` - `S15`),我们主动将 `slot_to_reg` 追踪关系中的 `x8`/`w8` 条目全部擦除失效。
|
||||
|
||||
#### 效果
|
||||
|
||||
隐式写寄存器失效策略完全排除了因常数池加载造成的寄存器污染问题,浮点计算和浮点比较指令行为变得绝对正确。
|
||||
|
||||
### 4.2 困难二:W 寄存器与 X 寄存器别名判定失误
|
||||
|
||||
#### 现象
|
||||
|
||||
在汇编生成时,可能会对同一个物理寄存器先后用 32 位和 64 位名称引用,如先 `str w8, [sp]`,后 `ldr x8, [sp]`。如果直接用简单的字符串比对或物理寄存器枚举值比对,会认为这是两个不相关的寄存器。
|
||||
|
||||
#### 解决办法
|
||||
|
||||
引入了 `NormalizeReg`:将所有的 64 位通用寄存器 `X0`-`X28` 归一化映射到其对应的 32 位别名 `W0`-`W28`。所有的别名冲突、冗余自移动消除(Self-move elimination)均基于归一化后的寄存器进行。
|
||||
|
||||
## 5. 验证结果
|
||||
|
||||
在 `lab5` 编译优化管线加入后,运行:
|
||||
```bash
|
||||
./scripts/verify_asm.sh test/test_case/functional/95_float.sy --run
|
||||
```
|
||||
退出码:`0`,输出完全匹配期望。
|
||||
|
||||
另外,对全部的 functional 样例执行回归测试:
|
||||
```bash
|
||||
for f in test/test_case/functional/*.sy; do
|
||||
./scripts/verify_asm.sh "$f" --run
|
||||
done
|
||||
```
|
||||
验证结果表明:**所有 functional 样例在窥孔优化开启后,均成功编译生成汇编、链接并完美运行,退出状态码与标准输出完全符合预期。**
|
||||
|
||||
## 6. 实验总结与后续工作
|
||||
|
||||
本次后端窥孔优化大幅缩减了物理汇编代码中冗余的栈读写指令和同名自拷贝指令,提高了生成代码的紧凑程度与执行效率。
|
||||
|
||||
后续可在当前工作的基础上,进一步在 Lab6 中打通更高级的循环不变式外提(LICM)等前端与中端的高级循环优化技术。
|
||||
105
doc/Lab6-实验记录.md
Normal file
105
doc/Lab6-实验记录.md
Normal file
@@ -0,0 +1,105 @@
|
||||
# Lab6 实验记录:循环优化(循环不变式外提 LICM)
|
||||
|
||||
## 1. 实验目标
|
||||
|
||||
本次 Lab6 的核心目标是在已有的中端优化框架下,针对控制流图中的循环结构实现高效的循环优化。
|
||||
|
||||
本次完成工作的重点包括:
|
||||
- 基于支配树(Dominator Tree)和控制流图(CFG),实现自然循环(Natural Loop)的识别与提取。
|
||||
- 实现循环不变式外提(Loop Invariant Code Motion, LICM)优化通道。
|
||||
- 精细地进行循环不变指令(如纯算术运算、比较运算、GEP 指令、类型转换指令等)的判定,并按正确的依赖顺序将它们外提到循环前导块(Preheader)中。
|
||||
- 修复支配树计算支配边界 `ComputeDF` 在面对 CFG 优化过程中临时产生的不可达前驱节点时引发的死循环挂起漏洞。
|
||||
- 使用功能测试用例完成端到端编译器全管线的正确性验证。
|
||||
|
||||
## 2. 代码改动范围
|
||||
|
||||
本次实验主要涉及和修改了以下模块:
|
||||
- `include/ir/PassManager.h`:增加 `RunLICM` 优化通道的函数声明。
|
||||
- `src/ir/analysis/DominatorTree.cpp`:修复支配边界计算(ComputeDF)中的死循环漏洞,增强在非连通图或带有临时死块的 CFG 下的鲁棒性。
|
||||
- `src/ir/passes/CMakeLists.txt`:将新实现的 `LICM.cpp` 编译单元加入 `ir_passes` 库构建中。
|
||||
- `src/ir/passes/PassManager.cpp`:在迭代式的函数优化主循环中集成 `RunLICM`。
|
||||
- `src/ir/passes/LICM.cpp`:全新实现了自然循环识别算法、循环块提取(GetLoopBlocks)以及依赖保序的循环不变式外提核心逻辑。
|
||||
- 新增文档:`doc/Lab6-实验记录.md`。
|
||||
|
||||
## 3. 完成过程
|
||||
|
||||
### 3.1 死循环漏洞(Compiler Freeze)的定位与修复
|
||||
|
||||
在未修复之前,测试脚本运行到 `95_float.sy` 时,编译器在 `RunLICM` 执行第一轮迭代时会彻底卡死。
|
||||
通过分析 core dump 并对数据流进行追踪,发现由于之前的 CFG 简化(CFGSimplify)或死代码消除(DCE)运行后,可能会留下部分暂时不连通或者从 Entry 块不可达的前驱基本块。
|
||||
当支配树对这些不连通块计算支配边界 `ComputeDF` 时,会在以下循环中无限挂起:
|
||||
```cpp
|
||||
while (runner != idom_b) {
|
||||
...
|
||||
runner = idom_[runner];
|
||||
}
|
||||
```
|
||||
因为不可达基本块没有正确的 `idom`,使得 `idom_[runner]` 产生空值或指向自身形成了自圈,导致 `runner` 永远无法到达 `idom_b`。
|
||||
|
||||
**解决办法**:
|
||||
在 `src/ir/analysis/DominatorTree.cpp` 中重构了 `ComputeDF` 遍历:
|
||||
```cpp
|
||||
while (runner && runner != idom_b) {
|
||||
auto idom_it = idom_.find(runner);
|
||||
if (idom_it == idom_.end()) {
|
||||
break; // 优雅阻断不可达的前驱节点
|
||||
}
|
||||
auto* next_runner = idom_it->second;
|
||||
if (next_runner == runner) {
|
||||
break; // 优雅阻断根节点/自环
|
||||
}
|
||||
...
|
||||
runner = next_runner;
|
||||
}
|
||||
```
|
||||
**效果**:
|
||||
该修复彻底阻断了任何支配树计算中的环路。修复后,`95_float.sy` 及所有含有复杂控制流的测试用例均可以在毫秒级内完成编译,没有发生任何挂起。
|
||||
|
||||
### 3.2 循环不变式外提(LICM)的具体设计与实现
|
||||
|
||||
LICM 的主要步骤如下:
|
||||
|
||||
1. **自然循环识别(Natural Loop Discovery)**:
|
||||
扫描 CFG 中所有的基本块与它们的后继块。若存在一条边 $B \to H$ 满足 $H$ 支配 $B$,则识别为一条回边(Back-edge),$H$ 即为循环头(Header)。
|
||||
|
||||
2. **收集循环体所有成员块(GetLoopBlocks)**:
|
||||
通过以 $B$ 为起点沿着前驱方向进行深度/广度优先搜索(DFS/BFS),直至遇到循环头 $H$ 为止,收录的所有可达块即为该自然循环的全部基本块集合。
|
||||
|
||||
3. **外提位置(Preheader)的安全性判定**:
|
||||
寻找 $H$ 在循环体外的唯一前驱基本块作为 Preheader。只有存在唯一外部前驱时,外提才是安全且有意义的。
|
||||
|
||||
4. **不变指令的保序判定与提取**:
|
||||
- 不变性判定标准:一条指令的所有操作数要么是常数,要么是在循环体外定义,要么是已被判定为循环不变的其它指令。
|
||||
- 保序要求:为了防止由于指令外提后操作数尚未计算而引发的未定义行为,我们按数据流依赖的先后顺序,将被判定为循环不变的指令有序地追加到前导块(Preheader)的末尾分支指令(Terminator)之前。
|
||||
|
||||
## 4. 关键困难与解决办法
|
||||
|
||||
### 4.1 困难一:GEP 等多操作数指令的外提合法性
|
||||
|
||||
#### 现象
|
||||
原先简单的 LICM 仅考虑了一元和常规二元运算(如 `Add`、`Sub`)。但实际的循环内部存在大量的数组多维索引计算(如 `GetElementPtr`)和类型转换(如 `ZExt`、`SIToFP`),如果不予考虑,外提优化效果会打折扣。
|
||||
|
||||
#### 解决办法
|
||||
将 `IsPureHoistingCandidate` 的识别范围扩宽到:
|
||||
- 算术与浮点运算:`Add` / `Sub` / `Mul` / `FAdd` / `FSub` / `FMul` / `FDiv` 等。
|
||||
- 比较与条件测试:`ICmp` / `FCmp` 的各种形态。
|
||||
- 类型转换:`ZExt`、`SIToFP`、`FPToSI`。
|
||||
- 地址计算:`GEP`(GetElementPtr)指令。
|
||||
|
||||
#### 效果
|
||||
不仅提升了循环内部求值的运行效率,而且由于 GEP 和类型转换能够被完美外提,后端分配物理寄存器时的压力也得到了有效缓解。
|
||||
|
||||
## 5. 验证结果
|
||||
|
||||
重新构建并执行所有的后端汇编生成与模拟执行测试:
|
||||
```bash
|
||||
cmake --build build -j4
|
||||
for f in test/test_case/functional/*.sy; do
|
||||
./scripts/verify_asm.sh "$f" --run
|
||||
done
|
||||
```
|
||||
验证结果表明:**优化管线在开启 LICM 循环优化后,全部测试样例均一次性顺利通过,汇编输出和退出码均与预期 100% 契合,未引入任何副作用。**
|
||||
|
||||
## 6. 实验总结与收获
|
||||
|
||||
本次实验成功克服了支配树边界计算在边界情况下的死循环漏洞,并实现了高质量的循环不变式外提优化,打通了编译器前端、中端优化到后端物理汇编生成的最后一公里,圆满达成了整个编译原理课程实验的各项标准。
|
||||
@@ -39,6 +39,7 @@ bool RunConstFold(Function* func, Context& ctx);
|
||||
bool RunDCE(Function* func);
|
||||
bool RunCFGSimplify(Function* func);
|
||||
bool RunCSE(Function* func);
|
||||
bool RunLICM(Function* func);
|
||||
|
||||
// Run the optimization pipeline on a Function or Module
|
||||
void RunOptimizationPasses(Module& module);
|
||||
|
||||
@@ -153,6 +153,7 @@ class MachineFunction {
|
||||
std::vector<std::unique_ptr<MachineFunction>> LowerToMIR(const ir::Module& module);
|
||||
void RunRegAlloc(MachineFunction& function);
|
||||
void RunFrameLowering(MachineFunction& function);
|
||||
void RunPeephole(MachineFunction& function);
|
||||
void PrintAsm(const MachineFunction& function, std::ostream& os);
|
||||
void PrintGlobals(const ir::Module& module, std::ostream& os);
|
||||
|
||||
|
||||
@@ -103,7 +103,16 @@ void DominatorTree::ComputeIdom() {
|
||||
// Intersect
|
||||
auto* finger1 = pred;
|
||||
auto* finger2 = new_idom;
|
||||
int finger_iter = 0;
|
||||
while (finger1 != finger2) {
|
||||
finger_iter++;
|
||||
if (finger_iter > 1000) {
|
||||
std::cerr << "FATAL: DominatorTree finger loop stuck! b=" << b->GetName()
|
||||
<< " pred=" << pred->GetName()
|
||||
<< " finger1=" << finger1->GetName()
|
||||
<< " finger2=" << finger2->GetName() << std::endl;
|
||||
std::abort();
|
||||
}
|
||||
while (rpo_index.at(finger1) > rpo_index.at(finger2)) {
|
||||
finger1 = idom_.at(finger1);
|
||||
}
|
||||
@@ -147,13 +156,21 @@ void DominatorTree::ComputeDF() {
|
||||
for (auto* pred : b->GetPredecessors()) {
|
||||
auto* runner = pred;
|
||||
auto* idom_b = idom_[b];
|
||||
while (runner != idom_b) {
|
||||
// If runner's df doesn't contain b already, add it
|
||||
while (runner && runner != idom_b) {
|
||||
auto idom_it = idom_.find(runner);
|
||||
if (idom_it == idom_.end()) {
|
||||
break; // Unreachable predecessor
|
||||
}
|
||||
auto* next_runner = idom_it->second;
|
||||
if (next_runner == runner) {
|
||||
break; // Reached root / entry
|
||||
}
|
||||
|
||||
auto& runner_df = df_[runner];
|
||||
if (std::find(runner_df.begin(), runner_df.end(), b) == runner_df.end()) {
|
||||
runner_df.push_back(b);
|
||||
}
|
||||
runner = idom_[runner];
|
||||
runner = next_runner;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -6,6 +6,7 @@ add_library(ir_passes STATIC
|
||||
CSE.cpp
|
||||
DCE.cpp
|
||||
CFGSimplify.cpp
|
||||
LICM.cpp
|
||||
)
|
||||
|
||||
target_link_libraries(ir_passes PUBLIC
|
||||
|
||||
198
src/ir/passes/LICM.cpp
Normal file
198
src/ir/passes/LICM.cpp
Normal file
@@ -0,0 +1,198 @@
|
||||
#include "ir/PassManager.h"
|
||||
#include <unordered_set>
|
||||
#include <unordered_map>
|
||||
#include <vector>
|
||||
#include <algorithm>
|
||||
#include <iostream>
|
||||
|
||||
namespace ir {
|
||||
|
||||
namespace {
|
||||
|
||||
// Helper to perform DFS and gather all blocks in a natural loop
|
||||
std::unordered_set<BasicBlock*> GetLoopBlocks(BasicBlock* B, BasicBlock* H) {
|
||||
std::unordered_set<BasicBlock*> loop;
|
||||
std::vector<BasicBlock*> worklist;
|
||||
|
||||
loop.insert(H);
|
||||
if (B != H) {
|
||||
loop.insert(B);
|
||||
worklist.push_back(B);
|
||||
}
|
||||
|
||||
while (!worklist.empty()) {
|
||||
auto* curr = worklist.back();
|
||||
worklist.pop_back();
|
||||
for (auto* pred : curr->GetPredecessors()) {
|
||||
if (loop.find(pred) == loop.end()) {
|
||||
loop.insert(pred);
|
||||
worklist.push_back(pred);
|
||||
}
|
||||
}
|
||||
}
|
||||
return loop;
|
||||
}
|
||||
|
||||
// Check if an opcode is a pure hoisting candidate (pure arithmetic, comparisons, GEP, casts)
|
||||
bool IsPureHoistingCandidate(Opcode op) {
|
||||
switch (op) {
|
||||
case Opcode::Add:
|
||||
case Opcode::Sub:
|
||||
case Opcode::Mul:
|
||||
case Opcode::ICmpEQ:
|
||||
case Opcode::ICmpNE:
|
||||
case Opcode::ICmpLT:
|
||||
case Opcode::ICmpGT:
|
||||
case Opcode::ICmpLE:
|
||||
case Opcode::ICmpGE:
|
||||
case Opcode::FAdd:
|
||||
case Opcode::FSub:
|
||||
case Opcode::FMul:
|
||||
case Opcode::FDiv:
|
||||
case Opcode::FCmpEQ:
|
||||
case Opcode::FCmpNE:
|
||||
case Opcode::FCmpLT:
|
||||
case Opcode::FCmpGT:
|
||||
case Opcode::FCmpLE:
|
||||
case Opcode::FCmpGE:
|
||||
case Opcode::ZExt:
|
||||
case Opcode::SIToFP:
|
||||
case Opcode::FPToSI:
|
||||
case Opcode::GEP:
|
||||
return true;
|
||||
default:
|
||||
return false;
|
||||
}
|
||||
}
|
||||
|
||||
} // namespace
|
||||
|
||||
bool RunLICM(Function* func) {
|
||||
bool changed = false;
|
||||
|
||||
// 1. Run DominatorTree Analysis
|
||||
DominatorTree dom_tree(func);
|
||||
dom_tree.Run();
|
||||
|
||||
// 2. Identify natural loops by scanning for back-edges
|
||||
// Back-edge is B -> H where H dominates B.
|
||||
std::unordered_map<BasicBlock*, std::unordered_set<BasicBlock*>> loops;
|
||||
for (const auto& bbPtr : func->GetBlocks()) {
|
||||
auto* B = bbPtr.get();
|
||||
for (auto* H : B->GetSuccessors()) {
|
||||
if (dom_tree.Dominates(H, B)) {
|
||||
// Found back-edge B -> H, merge loop blocks
|
||||
auto loop_blocks = GetLoopBlocks(B, H);
|
||||
loops[H].insert(loop_blocks.begin(), loop_blocks.end());
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// 3. Optimize each identified loop
|
||||
for (auto& pair : loops) {
|
||||
BasicBlock* H = pair.first;
|
||||
const auto& loop_blocks = pair.second;
|
||||
|
||||
// A preheader is the single predecessor of H outside the loop
|
||||
BasicBlock* preheader = nullptr;
|
||||
int num_outside_preds = 0;
|
||||
for (auto* pred : H->GetPredecessors()) {
|
||||
if (loop_blocks.find(pred) == loop_blocks.end()) {
|
||||
preheader = pred;
|
||||
num_outside_preds++;
|
||||
}
|
||||
}
|
||||
|
||||
// Hoist only if there is exactly one outside predecessor (which is the preheader)
|
||||
if (num_outside_preds != 1 || !preheader) {
|
||||
continue;
|
||||
}
|
||||
|
||||
// Identify loop-invariant instructions
|
||||
std::unordered_set<Instruction*> invariant_insts;
|
||||
std::vector<Instruction*> invariant_order;
|
||||
bool local_changed = true;
|
||||
while (local_changed) {
|
||||
local_changed = false;
|
||||
|
||||
for (auto* bb : loop_blocks) {
|
||||
for (const auto& instPtr : bb->GetInstructions()) {
|
||||
auto* inst = instPtr.get();
|
||||
|
||||
if (invariant_insts.find(inst) != invariant_insts.end()) {
|
||||
continue; // Already identified
|
||||
}
|
||||
|
||||
if (!IsPureHoistingCandidate(inst->GetOpcode())) {
|
||||
continue; // Cannot hoist impure instructions (load, store, call, branch)
|
||||
}
|
||||
|
||||
// Check if all operands are loop-invariant
|
||||
bool all_ops_invariant = true;
|
||||
for (size_t i = 0; i < inst->GetNumOperands(); ++i) {
|
||||
auto* op = inst->GetOperand(i);
|
||||
|
||||
// Constants are invariant
|
||||
if (dynamic_cast<ConstantValue*>(op)) {
|
||||
continue;
|
||||
}
|
||||
|
||||
// Values defined outside the loop are invariant
|
||||
if (auto* op_inst = dynamic_cast<Instruction*>(op)) {
|
||||
if (loop_blocks.find(op_inst->GetParent()) == loop_blocks.end()) {
|
||||
continue;
|
||||
}
|
||||
// If defined inside the loop, must be already marked invariant
|
||||
if (invariant_insts.find(op_inst) != invariant_insts.end()) {
|
||||
continue;
|
||||
}
|
||||
} else {
|
||||
// Arguments and Globals are always defined outside the loop
|
||||
continue;
|
||||
}
|
||||
|
||||
all_ops_invariant = false;
|
||||
break;
|
||||
}
|
||||
|
||||
if (all_ops_invariant) {
|
||||
invariant_insts.insert(inst);
|
||||
invariant_order.push_back(inst);
|
||||
local_changed = true;
|
||||
changed = true;
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Hoist the loop-invariant instructions into the preheader (in dependency order)
|
||||
for (auto* inst : invariant_order) {
|
||||
auto& source_insts = const_cast<std::vector<std::unique_ptr<Instruction>>&>(inst->GetParent()->GetInstructions());
|
||||
auto& preheader_insts = const_cast<std::vector<std::unique_ptr<Instruction>>&>(preheader->GetInstructions());
|
||||
|
||||
std::unique_ptr<Instruction> moved_inst;
|
||||
for (auto it = source_insts.begin(); it != source_insts.end(); ++it) {
|
||||
if (it->get() == inst) {
|
||||
moved_inst = std::move(*it);
|
||||
source_insts.erase(it);
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
if (moved_inst) {
|
||||
moved_inst->SetParent(preheader);
|
||||
// Insert right before the terminator branch instruction of the preheader block
|
||||
if (!preheader_insts.empty() && preheader->HasTerminator()) {
|
||||
auto* term = preheader_insts.back().get();
|
||||
preheader->InsertInstructionBefore(std::move(moved_inst), term);
|
||||
} else {
|
||||
preheader_insts.push_back(std::move(moved_inst));
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return changed;
|
||||
}
|
||||
|
||||
} // namespace ir
|
||||
@@ -4,13 +4,11 @@
|
||||
namespace ir {
|
||||
|
||||
void RunFunctionOptimizationPasses(Function* func, Context& ctx) {
|
||||
// 1. Promote memory-based local variables to SSA form using Mem2Reg
|
||||
RunMem2Reg(func, ctx);
|
||||
|
||||
// 2. Run scalar optimizations iteratively until convergence (no changes observed)
|
||||
bool changed = true;
|
||||
int iterations = 0;
|
||||
const int max_iterations = 16; // Safe limit to prevent compile-time infinite loops
|
||||
const int max_iterations = 16;
|
||||
|
||||
while (changed && iterations < max_iterations) {
|
||||
changed = false;
|
||||
@@ -19,6 +17,7 @@ void RunFunctionOptimizationPasses(Function* func, Context& ctx) {
|
||||
changed |= RunConstProp(func, ctx);
|
||||
changed |= RunConstFold(func, ctx);
|
||||
changed |= RunCSE(func);
|
||||
changed |= RunLICM(func);
|
||||
changed |= RunDCE(func);
|
||||
changed |= RunCFGSimplify(func);
|
||||
}
|
||||
|
||||
@@ -53,6 +53,7 @@ int main(int argc, char** argv) {
|
||||
for (auto& machine_func : machine_funcs) {
|
||||
mir::RunRegAlloc(*machine_func);
|
||||
mir::RunFrameLowering(*machine_func);
|
||||
mir::RunPeephole(*machine_func);
|
||||
if (need_blank_line) {
|
||||
std::cout << "\n";
|
||||
}
|
||||
|
||||
@@ -1,4 +1,185 @@
|
||||
// 窥孔优化(Peephole):
|
||||
// - 删除冗余 move、合并常见指令模式
|
||||
// - 提升最终汇编质量(按实现范围裁剪)
|
||||
#include "mir/MIR.h"
|
||||
#include <unordered_map>
|
||||
#include <vector>
|
||||
|
||||
namespace mir {
|
||||
|
||||
namespace {
|
||||
|
||||
PhysReg NormalizeReg(PhysReg reg) {
|
||||
int r = static_cast<int>(reg);
|
||||
// Map 64-bit X0-X28 registers to 32-bit W0-W28 registers to handle aliasing
|
||||
if (r >= static_cast<int>(PhysReg::X0) && r <= static_cast<int>(PhysReg::X28)) {
|
||||
return static_cast<PhysReg>(r - static_cast<int>(PhysReg::X0) + static_cast<int>(PhysReg::W0));
|
||||
}
|
||||
return reg;
|
||||
}
|
||||
|
||||
PhysReg MatchRegSize(PhysReg target, PhysReg src) {
|
||||
int t = static_cast<int>(target);
|
||||
int s = static_cast<int>(src);
|
||||
|
||||
bool target_is_64 = (t >= static_cast<int>(PhysReg::X0) && t <= static_cast<int>(PhysReg::X28)) ||
|
||||
t == static_cast<int>(PhysReg::X29) ||
|
||||
t == static_cast<int>(PhysReg::X30) ||
|
||||
t == static_cast<int>(PhysReg::SP);
|
||||
|
||||
bool src_is_64 = (s >= static_cast<int>(PhysReg::X0) && s <= static_cast<int>(PhysReg::X28)) ||
|
||||
s == static_cast<int>(PhysReg::X29) ||
|
||||
s == static_cast<int>(PhysReg::X30) ||
|
||||
s == static_cast<int>(PhysReg::SP);
|
||||
|
||||
if (target_is_64 && !src_is_64) {
|
||||
if (s >= static_cast<int>(PhysReg::W0) && s <= static_cast<int>(PhysReg::W28)) {
|
||||
return static_cast<PhysReg>(s - static_cast<int>(PhysReg::W0) + static_cast<int>(PhysReg::X0));
|
||||
}
|
||||
} else if (!target_is_64 && src_is_64) {
|
||||
if (s >= static_cast<int>(PhysReg::X0) && s <= static_cast<int>(PhysReg::X28)) {
|
||||
return static_cast<PhysReg>(s - static_cast<int>(PhysReg::X0) + static_cast<int>(PhysReg::W0));
|
||||
}
|
||||
}
|
||||
return src;
|
||||
}
|
||||
|
||||
bool IsFloatReg(PhysReg reg) {
|
||||
return reg >= PhysReg::S0 && reg <= PhysReg::S15;
|
||||
}
|
||||
|
||||
} // namespace
|
||||
|
||||
void RunPeephole(MachineFunction& function) {
|
||||
for (auto& block : function.GetBlocks()) {
|
||||
auto& insts = block.GetInstructions();
|
||||
std::vector<MachineInstr> optimized;
|
||||
|
||||
// Map from FrameIndex to the normalized physical register that currently holds its value
|
||||
std::unordered_map<int, PhysReg> slot_to_reg;
|
||||
|
||||
for (const auto& inst : insts) {
|
||||
Opcode op = inst.GetOpcode();
|
||||
const auto& ops = inst.GetOperands();
|
||||
|
||||
// 1. Handle register move elimination (e.g. mov w8, w8)
|
||||
if (op == Opcode::MovReg) {
|
||||
if (NormalizeReg(ops.at(0).GetReg()) == NormalizeReg(ops.at(1).GetReg())) {
|
||||
continue; // Delete redundant self-moves
|
||||
}
|
||||
}
|
||||
|
||||
// 2. Handle redundant Load after Store
|
||||
if (op == Opcode::LoadStack) {
|
||||
int fi = ops.at(1).GetFrameIndex();
|
||||
auto it = slot_to_reg.find(fi);
|
||||
if (it != slot_to_reg.end()) {
|
||||
PhysReg source_reg = it->second;
|
||||
PhysReg dest_reg = NormalizeReg(ops.at(0).GetReg());
|
||||
if (source_reg == dest_reg) {
|
||||
// Loading the same register that already has the value - completely redundant!
|
||||
continue;
|
||||
} else {
|
||||
// Replace LoadStack dest_reg, fi with MovReg dest_reg, matched_source
|
||||
PhysReg matched_source = MatchRegSize(ops.at(0).GetReg(), it->second);
|
||||
optimized.push_back(MachineInstr(Opcode::MovReg, {Operand::Reg(ops.at(0).GetReg()), Operand::Reg(matched_source)}));
|
||||
|
||||
// Invalidate any other slots mapping to dest_reg because dest_reg is written
|
||||
std::vector<int> to_remove;
|
||||
for (const auto& pair : slot_to_reg) {
|
||||
if (NormalizeReg(pair.second) == dest_reg) {
|
||||
to_remove.push_back(pair.first);
|
||||
}
|
||||
}
|
||||
for (int key : to_remove) {
|
||||
slot_to_reg.erase(key);
|
||||
}
|
||||
|
||||
// Add new mapping (normalized)
|
||||
slot_to_reg[fi] = dest_reg;
|
||||
continue;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// 3. Track stores
|
||||
if (op == Opcode::StoreStack) {
|
||||
PhysReg src = NormalizeReg(ops.at(0).GetReg());
|
||||
int fi = ops.at(1).GetFrameIndex();
|
||||
slot_to_reg[fi] = src;
|
||||
}
|
||||
|
||||
// 4. Invalidate register mappings on writes
|
||||
bool writes_reg = false;
|
||||
PhysReg written_reg = PhysReg::W0; // dummy
|
||||
|
||||
switch (op) {
|
||||
case Opcode::MovImm:
|
||||
if (!ops.empty() && ops.at(0).GetKind() == Operand::Kind::Reg) {
|
||||
writes_reg = true;
|
||||
written_reg = NormalizeReg(ops.at(0).GetReg());
|
||||
|
||||
// Under the hood, MovImm to a float register implicitly writes to x8/w8
|
||||
if (IsFloatReg(ops.at(0).GetReg())) {
|
||||
PhysReg implicitly_written = NormalizeReg(PhysReg::X8);
|
||||
std::vector<int> to_remove;
|
||||
for (const auto& pair : slot_to_reg) {
|
||||
if (NormalizeReg(pair.second) == implicitly_written) {
|
||||
to_remove.push_back(pair.first);
|
||||
}
|
||||
}
|
||||
for (int key : to_remove) {
|
||||
slot_to_reg.erase(key);
|
||||
}
|
||||
}
|
||||
}
|
||||
break;
|
||||
case Opcode::LoadStack:
|
||||
case Opcode::AddRR:
|
||||
case Opcode::SubRR:
|
||||
case Opcode::MulRR:
|
||||
case Opcode::SDivRR:
|
||||
case Opcode::MSubRRRR:
|
||||
case Opcode::FAddRRR:
|
||||
case Opcode::FSubRRR:
|
||||
case Opcode::FMulRRR:
|
||||
case Opcode::FDivRRR:
|
||||
case Opcode::Cset:
|
||||
case Opcode::MovReg:
|
||||
case Opcode::Adrp:
|
||||
case Opcode::AddRegImm:
|
||||
case Opcode::LdrRegReg:
|
||||
case Opcode::SIToFP:
|
||||
case Opcode::FPToSI:
|
||||
case Opcode::ZExt:
|
||||
if (!ops.empty() && ops.at(0).GetKind() == Operand::Kind::Reg) {
|
||||
writes_reg = true;
|
||||
written_reg = NormalizeReg(ops.at(0).GetReg());
|
||||
}
|
||||
break;
|
||||
case Opcode::Call:
|
||||
// A function call destroys all temporary/scratch registers.
|
||||
slot_to_reg.clear();
|
||||
break;
|
||||
default:
|
||||
break;
|
||||
}
|
||||
|
||||
if (writes_reg) {
|
||||
// Remove any slot mapping to this register
|
||||
std::vector<int> to_remove;
|
||||
for (const auto& pair : slot_to_reg) {
|
||||
if (NormalizeReg(pair.second) == written_reg) {
|
||||
to_remove.push_back(pair.first);
|
||||
}
|
||||
}
|
||||
for (int key : to_remove) {
|
||||
slot_to_reg.erase(key);
|
||||
}
|
||||
}
|
||||
|
||||
optimized.push_back(inst);
|
||||
}
|
||||
|
||||
insts = std::move(optimized);
|
||||
}
|
||||
}
|
||||
|
||||
} // namespace mir
|
||||
|
||||
Reference in New Issue
Block a user