SSA构建和销毁
phinode的创建和消除
1. SSA构建算法
依赖:
- dom tree
- DF
- llvm ir基础知识
对以下c代码
int
clang会生成类似这种IR
define i32 @
我们能看到llvm采用了alloc+load+store指令来实现c中局部变量的存放。
这里参数处理i32 %a; store %a, %pa
实际上是个比较复杂的话题,按下不表。
这段IR的有很多冗余的load/store, 有没有什么办法消除掉呢?
SROA(mem2reg)!
预期的IR
define i32 @
如果控制流比较复杂呢?
int b = a;
ifelse
return b;
=>
ir
bb0:
%pb = alloc i32
store %a, %pb
%c = CMP_GT %a, 10
br %c, %bb1, %bb2
bb1:
%b0 = load %pb
%b1 = add %b0, %a
store %b1, %pb
br %bb4
bb3:
%b2 = load %pb
%b3 = mul %b2, %a
store %b3, %pb
br %bb4
bb4:
%b4 = load %pb
ret %b4
这里我们发现,如果想要正确删除掉这些load/store,我们需要引入phinode
,而且必须拿到控制流信息。phinode是个逻辑节点,在实际的CPU上没对应的指令/寄存器表示。所以我们在生成实际的汇编前需要再删除掉phi。
接下来看如何实现SSA构建:
原理分为两步, insertPHI, Rename, 伪代码如下:
procedure for each variable v in variable_list
WorkList ← ∅
EverOnWorkList ← ∅
AlreadyHasPhiFunc ← ∅
for each node n containing an assignment to v
WorkList ← WorkList ∪
end
EverOnWorkList ← WorkList
while WorkList ≠ ∅
Remove some node n from for each d ∈
if d ∉ AlreadyHasPhiFunc
Insert a φ-function for v at d
AlreadyHasPhiFunc ← AlreadyHasPhiFunc ∪
if d ∉ EverOnWorkList
WorkList ← WorkList ∪
EverOnWorkList ← EverOnWorkList ∪
end
end
end
end procedure = new
vn Push vn onto Stacks
return v
end procedure if b previously visited return
for each φ-function p in b
v =
vn = and replace v with vn
end
for each statement s in for each variable v ∈
replace v by end
for each variable v ∈
vn = and replace v with vn
end
for each s ∈
j ← position in s’s φ-function corresponding to block b
for each φ-function p in s
replace the jth operand of by
end
end
for each s ∈
for each φ-function or statement t in b
for each vi ∈
end
end SSAConstruction:
end
我们看到这种伪代码的形式和LLVMIR区别很大。因为IR格式不一样,接下来看llvm ir下如何实现mem2reg
当然,以上代码不是llvm mem2reg实际实现代码。
llvm实现做了一些优化,domtree遍历优化,特殊情况处理等。
SSA destruction
如何销毁SSA形式,删除phinode。
llvm的phi-elimination是在mir阶段发生。
一般来说,SSA destruction是将phinode替换为copy指令。
- 暴力做法,对于
p = phi (%a1, %bb1), (%a2, %bb2) ....
,在bb1,bb2... 尾部插入p = copy %a...
然后删除掉phi节点。很不幸,改算法是错误的。 考虑到lost copy和swap problem:
ref *Revisiting Out-of-SSA Translation for Correctness, Code Quality, and Efficiency
我们的暴力算法会生成
swap problem:
b0:
a1 = ...
b1 = ...
a2 = copy a1
b2 = copy b1
jump b1:
a2 = copy b2 <--- 没有swap
b2 = copy a2
if p jump b1
jump b2:
...
lost-copy-problem
b0:
x1 = ...
x2 = copy x1
jump b1:
x3 = x2 + 1
x2 = copy x3
if p jump b1
jump b2:
... = x2 <---- 这里错误
- split critical edges算法
图中红边就是critical edge:
swap problem结果看起来不太对哦。
看起来是插入copy出了问题, 再看一眼
a2=phi(a1, b2);b2=phi(b1,a2)
是个环。。。
我们发现swap problem中
- a2和b2的生命周期重叠了
- 并且插入copy方式也需要改进
- isolating phi + parallel copy
先介绍几个概念
- CSSA(Conventional SSA) form is defined as SSA form for which each phi-web is interference-free.
- TSSA(Transformed SSA) form is non-conventional SSA, may have phi-web is not interference-free.
swap problem和lost-copy problem中 都是TSSA,而不是CSSA。 如何将TSSA转换为CSSA?并且在CSSA中插入copy是否还需要注意类似swapproblem的情况?
所以我们接下来的算法:
- Insert parallel copies for all φ-functions (TSSA => CSSA)
- eliminate phis in CSSA
- Sequentialize parallel copies, possibly with one more variable and some additional copies
- some optimization
CSSA形式
这代码逻辑看起来是没问题了,就是代码质量堪忧。。。
消除phi节点后:
上一小节提到的swap中copy的问题仍然存在,所以接下来要介绍,parallel copies的概念。
我们将这些插入的copy指令视为parallel copies,然后采用算法求解copy插入顺序
ref Revisiting Out-of-SSA Translation for Correctness, Code Quality, and Efficiency 不幸的是,这个算法有点小问题。 实际上是个图遍历算法:cc09.pdf
我们用python写个demo:
=
=
return %
=
=
=
=
=
= None
=
=
= None
= None
=
=
=
=
=
=
=
=
=
=
=
=
# <<<<< 这里
=
=
运行脚本,啥也不输出。 但是如果我们将
b==l
改成b!=l
, 就会输出 emit copy tmp<- a emit copy a <- b emit copy b <- tmp
至此,我们完成了phi-elimination的一半,合法化问题解决了。但是性能问题没解决。
优化:TODO
我们重新关注下parallel copies 。
对于
bb1:
a1 = ...
b1 = ...
c1 = ...
jump bb2:
a2 = ...
b2 = ...
c2 = ...
jump bb3:
a3 = ...
b3 = ...
c3 = ...
jump bb:
a =
b =
c =
注意
\\
可能会被识别为\
, 用\newline
更好
$\begin{bmatrix} a \newline b \newline c \end{bmatrix} = Φ \begin{bmatrix} a1 & a2 & a3 \newline b1 & b2 & b3 \newline c1 &c2 & c3 \end{bmatrix} $
那么对于bb1来说,就有parallel copies: $a \gets a1, b \gets b1, c \gets c1$。 其对应的Location Transfer Graph为$G = (V, E), V = \lbrace a,b,c, a1, b1, c1\rbrace, E = \lbrace a \gets a1, b \gets b1, c \gets c1 \rbrace$
再看下swap problem里面的
a2 =
b2 =
- 前驱1的parallel copies
flowchart TD
a1 --> a2
b1 --> b2
- 前驱2的parallel copies
flowchart TD
a2 --> b2
b2 --> a2
参考
- https://www.cs.utexas.edu/~pingali/CS380C/2010/papers/ssaCytron.pdf
- https://ics.uci.edu/~yeouln/course/ssa.pdf
- *Revisiting Out-of-SSA Translation for Correctness, Code Quality, and Efficiency
- cc09.pdf