Discussion:
[Open64-devel] one more question about strength reduction and SSA PRE
Yiran Wang
2013-06-27 00:07:52 UTC
Permalink
Hi All,

This one looks somewhat similar to the last example, but is different.

int foo(int N, int j, int *x, int *z)
{
int y = N;
N += 7;
N >>= 3;
int i;
for(i = 0; i< j; i++)
{
x += N*N << 3;
z = x + N;
y = y + *x + *z;
}
return y;
}

Assembly of the loop at -O3.
.p2align 4,,15
.Lt_0_3586:
#<loop> Loop body line 7, nesting depth: 1, estimated iterations: 1000
.loc 1 9 0
# 8 {
# 9 x += N*N << 3;
movl %eax,%ebx # [0]
.loc 1 11 0
# 10 z = x + N;
# 11 y = y + *x + *z;
addl $1,%ebp # [0]
.loc 1 9 0
imull %eax,%ebx # [1]
shll $3,%ebx # [4]
shll $2,%ebx # [5]
addl %ebx,%edi # [6]
addl %ebx,%esi # [6]
.loc 1 11 0
movl 0(%edi),%ecx # [7] id:23
addl 0(%esi),%ecx # [10]
addl %ecx,%edx # [13]
cmpl 36(%esp),%ebp # [13] j
jl .Lt_0_3586 # [16]

As we see, the imul instruction remains in the loop.
(and two consequent shll instructions, my guess is that CG is thinking
there should not be such input from WOPT, so it is not optimized in CG,
though it is simple. )

It looks like SSA PRE omitted the rhs of Iv_update statement x+= N*N<<3,
and VNFRE is only doing one level of CSE, say, promoting the ASHR + LDC 3
out of the loop.

I am curious why SSA PRE is omitting the expression here. By disabling
this in opt_etable.cxx, the result looks good for this test case. I wonder
if there is any correctness issue for some other test case, or performance
issue?

It should be noted one strength reduction transformation is done for z for
this case. Also replacing "N>>=3;" with "N*=5;" results in similar
sub-optimal code.

Best Regards,
Yiran Wang
Jian-Xin Lai
2013-06-27 07:26:02 UTC
Permalink
Yiran Wang
2013-06-27 17:14:17 UTC
Permalink
Thanks for your comments.

No, the assembly looks the same.

Good or bad, the compiler is able to clean up the temporary completely,
say, copy propagation and DSE.

Regards,
Yiran



On Thu, Jun 27, 2013 at 12:26 AM, Jian-Xin Lai <***@gmail.com> wrote:
Sun Chan
2013-06-27 18:32:45 UTC
Permalink
is N global?
Post by Yiran Wang
Thanks for your comments.
No, the assembly looks the same.
Good or bad, the compiler is able to clean up the temporary completely,
say, copy propagation and DSE.
Regards,
Yiran
Sun Chan
2013-06-27 18:33:32 UTC
Permalink
forget the stupid question. Something else is wrong
Sun
Post by Sun Chan
is N global?
Post by Yiran Wang
Thanks for your comments.
No, the assembly looks the same.
Good or bad, the compiler is able to clean up the temporary completely,
say, copy propagation and DSE.
Regards,
Yiran
Jian-Xin Lai
2013-07-01 08:44:04 UTC
Permalink
Using option "-WOPT:icopy=off:copy=off" to disable all copy-propagation?
Post by Yiran Wang
Thanks for your comments.
No, the assembly looks the same.
Good or bad, the compiler is able to clean up the temporary completely,
say, copy propagation and DSE.
Regards,
Yiran
Fred Chow
2013-07-02 12:29:41 UTC
Permalink
Hi Yiran,

The reason is because PRE is not applied to the increment amount of an
IV update statement. In your example,

x += N*N << 3;

is an IV update statement. If PRE is applied, the rhs of the IV update
statement may be transformed so that the statement is no longer an IV
update statement, which may in turn disable other IV-related optimizations.

If you change the above statement to, say:

x = z + (N*N << 3);

(N*N << 3) will then be hoisted out of the loop because it is no longer
an IV update statement.

You can grep for "Set_omitted()" in opt_etable.cxx and see that
"occur->Stmt()->Iv_update()" is one reason an expression is set omitted.

Fred
Post by Yiran Wang
Hi All,
This one looks somewhat similar to the last example, but is different.
int foo(int N, int j, int *x, int *z)
{
int y = N;
N += 7;
N >>= 3;
int i;
for(i = 0; i< j; i++)
{
x += N*N << 3;
z = x + N;
y = y + *x + *z;
}
return y;
}
Assembly of the loop at -O3.
.p2align 4,,15
#<loop> Loop body line 7, nesting depth: 1, estimated iterations: 1000
.loc190
# 8 {
# 9 x += N*N << 3;
movl %eax,%ebx # [0]
.loc1110
# 10 z = x + N;
# 11 y = y + *x + *z;
addl $1,%ebp # [0]
.loc190
imull %eax,%ebx # [1]
shll $3,%ebx # [4]
shll $2,%ebx # [5]
addl %ebx,%edi # [6]
addl %ebx,%esi # [6]
.loc1110
movl 0(%edi),%ecx # [7] id:23
addl 0(%esi),%ecx # [10]
addl %ecx,%edx # [13]
cmpl 36(%esp),%ebp # [13] j
jl .Lt_0_3586 # [16]
As we see, the imul instruction remains in the loop.
(and two consequent shll instructions, my guess is that CG is thinking
there should not be such input from WOPT, so it is not optimized in
CG, though it is simple. )
It looks like SSA PRE omitted the rhs of Iv_update statement x+=
N*N<<3, and VNFRE is only doing one level of CSE, say, promoting the
ASHR + LDC 3 out of the loop.
I am curious why SSA PRE is omitting the expression here. By
disabling this in opt_etable.cxx, the result looks good for this test
case. I wonder if there is any correctness issue for some other test
case, or performance issue?
It should be noted one strength reduction transformation is done for z
for this case. Also replacing "N>>=3;" with "N*=5;" results in similar
sub-optimal code.
Best Regards,
Yiran Wang
------------------------------------------------------------------------------
Build for Windows Store.
http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Open64-devel mailing list
https://lists.sourceforge.net/lists/listinfo/open64-devel
Yiran Wang
2013-07-02 20:20:17 UTC
Permalink
Hi Fred,

Thanks for your reply, and one more question.
What kind of IV related optimizations are we talking about here?

Actually I have tried to disable that condition in opt_etable.cxx before,
and for this specific case, the output looks good. The update code of IV
"x" and "z" looks efficient.

.Lt_0_2050:
#<loop> Loop body line 3, nesting depth: 1, estimated iterations: 100
.loc 1 9 0
# 5 N *= 5;
# 6 int i;
# 7 for(i = 0; i< j; i++)
# 8 {
# 9 x += N * N << 3;
addl %eax,%edi # [0]
addl %eax,%ebp # [0]
.loc 1 7 0
addl $1,%esi # [0]
.loc 1 11 0
# 10 z = x + N;
# 11 y = y + *x + *z;
addl 0(%edi),%edx # [1]
addl 140(%ebp),%edx # [4]
.loc 1 7 0
movl 36(%esp),%ebx # [4] j
cmpl %esi,%ebx # [7]
jg .Lt_0_2050 # [8]

As we can see, the gain is noticeable.

It looks like IVR created the new IV, and not modified "x" or "z", and
strength reduction works fine.

Best Regards,
Yiran
Post by Fred Chow
Hi Yiran,
The reason is because PRE is not applied to the increment amount of an IV
update statement. In your example,
x += N*N << 3;
is an IV update statement. If PRE is applied, the rhs of the IV update
statement may be transformed so that the statement is no longer an IV
update statement, which may in turn disable other IV-related optimizations.
x = z + (N*N << 3);
(N*N << 3) will then be hoisted out of the loop because it is no longer an
IV update statement.
You can grep for "Set_omitted()" in opt_etable.cxx and see that
"occur->Stmt()->Iv_update()" is one reason an expression is set omitted.
Fred
Hi All,
This one looks somewhat similar to the last example, but is different.
int foo(int N, int j, int *x, int *z)
{
int y = N;
N += 7;
N >>= 3;
int i;
for(i = 0; i< j; i++)
{
x += N*N << 3;
z = x + N;
y = y + *x + *z;
}
return y;
}
Assembly of the loop at -O3.
.p2align 4,,15
#<loop> Loop body line 7, nesting depth: 1, estimated iterations: 1000
.loc 1 9 0
# 8 {
# 9 x += N*N << 3;
movl %eax,%ebx # [0]
.loc 1 11 0
# 10 z = x + N;
# 11 y = y + *x + *z;
addl $1,%ebp # [0]
.loc 1 9 0
imull %eax,%ebx # [1]
shll $3,%ebx # [4]
shll $2,%ebx # [5]
addl %ebx,%edi # [6]
addl %ebx,%esi # [6]
.loc 1 11 0
movl 0(%edi),%ecx # [7] id:23
addl 0(%esi),%ecx # [10]
addl %ecx,%edx # [13]
cmpl 36(%esp),%ebp # [13] j
jl .Lt_0_3586 # [16]
As we see, the imul instruction remains in the loop.
(and two consequent shll instructions, my guess is that CG is thinking
there should not be such input from WOPT, so it is not optimized in CG,
though it is simple. )
It looks like SSA PRE omitted the rhs of Iv_update statement x+= N*N<<3,
and VNFRE is only doing one level of CSE, say, promoting the ASHR + LDC 3
out of the loop.
I am curious why SSA PRE is omitting the expression here. By disabling
this in opt_etable.cxx, the result looks good for this test case. I wonder
if there is any correctness issue for some other test case, or performance
issue?
It should be noted one strength reduction transformation is done for z
for this case. Also replacing "N>>=3;" with "N*=5;" results in similar
sub-optimal code.
Best Regards,
Yiran Wang
------------------------------------------------------------------------------
Build for Windows Store.
http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Fred Chow
2013-07-03 11:25:42 UTC
Permalink
IV-related optimizations are strength reduction and linear function test
replacement.

For example, in the loop body:

. . i + 1 . .
i = i + 1;

If we do PRE on the IV update statement, it will change i = i + 1 to i =
t, and strength reduction will not work.

Fred
Post by Yiran Wang
Hi Fred,
Thanks for your reply, and one more question.
What kind of IV related optimizations are we talking about here?
Actually I have tried to disable that condition in opt_etable.cxx
before, and for this specific case, the output looks good. The update
code of IV "x" and "z" looks efficient.
#<loop> Loop body line 3, nesting depth: 1, estimated iterations: 100
.loc190
# 5 N *= 5;
# 6 int i;
# 7 for(i = 0; i< j; i++)
# 8 {
# 9 x += N * N << 3;
addl %eax,%edi # [0]
addl %eax,%ebp # [0]
.loc170
addl $1,%esi # [0]
.loc1110
# 10 z = x + N;
# 11 y = y + *x + *z;
addl 0(%edi),%edx # [1]
addl 140(%ebp),%edx # [4]
.loc170
movl 36(%esp),%ebx # [4] j
cmpl %esi,%ebx # [7]
jg .Lt_0_2050 # [8]
As we can see, the gain is noticeable.
It looks like IVR created the new IV, and not modified "x" or "z", and
strength reduction works fine.
Best Regards,
Yiran
Hi Yiran,
The reason is because PRE is not applied to the increment amount
of an IV update statement. In your example,
x += N*N << 3;
is an IV update statement. If PRE is applied, the rhs of the IV
update statement may be transformed so that the statement is no
longer an IV update statement, which may in turn disable other
IV-related optimizations.
x = z + (N*N << 3);
(N*N << 3) will then be hoisted out of the loop because it is no
longer an IV update statement.
You can grep for "Set_omitted()" in opt_etable.cxx and see that
"occur->Stmt()->Iv_update()" is one reason an expression is set omitted.
Fred
Post by Yiran Wang
Hi All,
This one looks somewhat similar to the last example, but is different.
int foo(int N, int j, int *x, int *z)
{
int y = N;
N += 7;
N >>= 3;
int i;
for(i = 0; i< j; i++)
{
x += N*N << 3;
z = x + N;
y = y + *x + *z;
}
return y;
}
Assembly of the loop at -O3.
.p2align 4,,15
#<loop> Loop body line 7, nesting depth: 1, estimated
iterations: 1000
.loc190
# 8 {
# 9 x += N*N << 3;
movl %eax,%ebx # [0]
.loc1110
# 10 z = x + N;
# 11 y = y + *x + *z;
addl $1,%ebp # [0]
.loc190
imull %eax,%ebx # [1]
shll $3,%ebx # [4]
shll $2,%ebx # [5]
addl %ebx,%edi # [6]
addl %ebx,%esi # [6]
.loc1110
movl 0(%edi),%ecx # [7] id:23
addl 0(%esi),%ecx # [10]
addl %ecx,%edx # [13]
cmpl 36(%esp),%ebp # [13] j
jl .Lt_0_3586 # [16]
As we see, the imul instruction remains in the loop.
(and two consequent shll instructions, my guess is that CG is
thinking there should not be such input from WOPT, so it is not
optimized in CG, though it is simple. )
It looks like SSA PRE omitted the rhs of Iv_update statement x+=
N*N<<3, and VNFRE is only doing one level of CSE, say, promoting
the ASHR + LDC 3 out of the loop.
I am curious why SSA PRE is omitting the expression here. By
disabling this in opt_etable.cxx, the result looks good for this
test case. I wonder if there is any correctness issue for some
other test case, or performance issue?
It should be noted one strength reduction transformation is done
for z for this case. Also replacing "N>>=3;" with "N*=5;" results
in similar sub-optimal code.
Best Regards,
Yiran Wang
------------------------------------------------------------------------------
Build for Windows Store.
http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Open64-devel mailing list
https://lists.sourceforge.net/lists/listinfo/open64-devel
Loading...