Code Alignment

The modern CPUs are usually getting instruction decoder stalls when branch targets are near code cache-line boundary. To avoid the problem, the targets of common branches and bodies of loops are recommended to be aligned to start at new cache-line boundary. Also when proceeding a code block that is rarely executed, targeting the branch just before cache-line boundary is wasteful concerning the code cache pollution.

Old GCC contained code to align all loops found in the code, all function bodies and all code following unconditional jumps to specified values according to target machine description. This strategy is however somewhat wasteful. For instance AMD Athlon chip recommends to align to 32byte boundary wasting 16 bytes up to 20% of code at the average.

We have implemented new pass that uses profile to carefully place alignments. We use the following set of conditions:

in case basic block is not reached via fallthru edge check:
- basic block is likely to be executed at least once
- the sum of branch frequencies is very high (1/10th of maximum inside function) or the previous basic block is unlikely executed
If both conditions hold, the basic block is aligned by the same alignment as old code did for instruction following unconditional jump.
in case basic block is reached by both fallthru and branch edge check:
- basic block is in the hot spot of the program
- the sum of frequencies of branch edges is at least 5 times higher than fallthru edge frequency
- basic block is in the very hot spot in the function
If all the conditions are met, the basic block is aligned by loop alignment.

We have found that our code limits code growth to 5% while maintaining approximately the same performance in all benchmarks.

Jan Hubicka 2003-05-04