Subsections

Loop Optimizer

We started our work on new loop optimizer by replacing the part responsible for most structural changes - the unroller. We have postponed work on other optimizations (induction variable elimination, strength reduction) until we have mid-level RTL and SSA form that should make the work much easier.

We have implemented basic functions for CFG manipulation able to update loop structure (and also to keep dominator tree up-to-date, which is important for some of our optimizations).

Using them, we have created several useful functions for loop structure manipulation (copying of loop body to the edge coming into its header - operation common for both loop unrolling and peeling, removing the branch of if condition inside loop - for unswitching and clever loop unrolling, loopification of a part of the CFG, creating a canonical description of simple (in sense that we are able to determine number of their iterations) loops).

We are able to keep profiling information consistent in most of the times (the only place where we have problem is removing branches; it is hard to recalculate frequencies precisely afterwards, so we only use some approximation).

Loop Unrolling

By loop unrolling we mean copying the loop body several times. The result is that in a loop that iterates often enough we create longer run of instructions that are executed sequentially (without jumps). Such code is often executed faster.

while (1)
 {
  puts("Header");
  if (should_end ())
   break;
  puts("Latch");
 }


while (1)
 {
  puts("Header");
  if (should_end ())
   break;
  puts("Latch");
  puts("Header");
  if (should_end ())
   break;
  puts("Latch");
  puts("Header");
  if (should_end ())
   break;
  puts("Latch");
 }
\resizebox{!}{60mm}{\includegraphics{unroll_simple.ps}}

With the described machinery, unrolling is quite easy - it suffices to copy loop body on the edge from latch to header.

This stupid loop unrolling is not always a win; it makes code to grow, which may cause problems with caches, making the program to run slower. To avoid this, we only unroll small loops. With profile informations, we only unroll loops that are expected to iterate often enough and also disable the optimization on places that do not execute often enough.

We have also written some more clever variants of the basic loop unrolling scheme:

for (i = 0; i < 10; i++)
 printf ("%d\n", i);
$ \downarrow$
i = 0;
while (1)
 {
  printf ("%d\n", i++);
  if (i>=10)
   break;
  printf ("%d\n", i++);
  printf ("%d\n", i++);
 }




for (i = 0; i < k; i++)
 printf ("%d\n", i);
$ \downarrow$
i = 0;
m = (k - i) % 3;
if (!m--)
 goto beg;
if (i>=k)
 goto end;
printf ("%d\n", i++);
if (!m--)
 goto beg;
printf ("%d\n", i++);
beg:
while (i < k)
 {
  printf ("%d\n", i++);
  printf ("%d\n", i++);
  printf ("%d\n", i++);
 }
end:;
\resizebox{105mm}{!}{\includegraphics{unroll_clever.ps}}

Unrolling Loop With Known Number Of Iterations

If we are able to determine the number of loop iterations exactly, we can either peel the loop completely (removing the loop) if it is sufficiently small, or at least determine precisely the exit that will be taken out of the loop (by counting the number of iterations modulo the number of unrollings). We can then remove exit edges that we know that are not used.

Unrolling Loop With Number Of Iterations Known At Runtime

Often it happens that we are not able to count number of iterations in time of compilation, but we are able to count it in runtime. Then we can add code to ensure that number of iterations is divisible by number of unrollings (by peeling loop several times and adding conditions to check that correct number of iterations is remaining, then unroll the loop and remove exit tests from all but one copy of body.

This is not the whole story, as in this case we have to count with some degenerated instances. The easy one is that the loop may not roll at all; this we can check rather easily. More difficulties arise from the fact that we also have to check for overflows. Fortunately, if the number of times we unroll the loop is a power of $ 2$, this will not cause the problems (as overflows only occur in case the the final check is for equality and range of integer types is power of $ 2$, therefore modulo number of unrollings the result is the same).

In both of these cases we benefit not only from greater sequentiality of code, but also from removing the unnecessary checks and better possibilities for further optimizations.

Loop Peeling

while (1)
 {
  puts("Header");
  if (should_end ())
   break;
  puts("Latch");
 }
$ \rightarrow$
puts("Header");
if (should_end ())
 break;
puts("Latch");
puts("Header");
if (should_end ())
 break;
puts("Latch");
while (1)
 {
  puts("Header");
  if (should_end ())
   break;
  puts("Latch");
 }
\resizebox{60mm}{!}{\includegraphics{peel.ps}}

Peeling is similar to unrolling, but instead of copying loop to latch edge, we copy it to preheader edge, thus decreasing the number of loop iterations. Peeling is used if number of loop unrollings is small; the resulting code is again more sequential.

Loop Unswitching

i=10;
while (1)
 {
  if (flag)
   puts("True");
  else
   {
    puts("False");
    if (!i--)
     break;
   }



 i=10;
 if (flag)
  while (1)
   puts("True");
 else
  while (1)
   {
    puts("False");
    if (!i--)
     break;
   }
 }
\resizebox{!}{60mm}{\includegraphics{unswitch.ps}}

Loop unswitching is used in case when some condition tested inside loop is constant inside it. Then we can move the condition out of the loop and create the copy of the loop. Then in the original loop we can expect the condition to be true (and remove the corresponding branch) and in the copy to be false (and again remove the branch).

We must be a bit careful here, as if we had several opportunities for unswitching inside a single loop, the code size could grow exponentially in their number. Therefore, we must have some threshold over that we will not go.

Other interesting case is when we test for the same condition several times. We can then use our knowledge of which branch we are in to just eliminate the incorrect path; the only small problem associated with this is that we could create unreachable blocks, or even cancel the loop. Both of these cases are hard to handle, so we just revert to former case (it does not occur too often, and the following optimization passes will get rid of that much apparently unreachable code anyway).

Jan Hubicka 2003-05-04