Tracer [2] performs the tail duplication which is needed for superblock formation. Usually it is used for scheduling but we have not tested it because GCC does not have general Superblock Scheduler (GCC has it only for IA-64) but we plan to generalize it in the future. We do not suppose that the scheduling will be profitable for Athlon (where we tested everything) because Athlon itself does the scheduling. On Athlon, we use Tracer to improve the effect of other optimizations. We do not need to worry about the code growth too much because Cross-jumping removes the unneeded tail duplicates when optimizations are not performed. The simple example how tracer helps other optimizations is the following code. Tracer makes possible for CSE to remove the second condition.
The start trace finding from the basic block that is not in any trace yet
and has the highest execution frequency. The Fibonacci heap is used for fast
finding of such a block. The trace is then grown backwards and forwards using
mutual most likely heuristic. The heuristic requires that for block A that is
followed by block B in the trace, A must be B's most likely predecessor and B
must be A's most likely successor. While the trace is growing backwards the
backward growing can also stop if the probability of the most likely predecessor
is lower than MIN_BRANCH_RATIO
and while the trace is growing forwards
the growing can also stop if the probability of the most likely successor is
lower than MIN_BRANCH_PROBABILITY
.
When we have a trace we perform tail duplication. We visit the blocks on the trace in trace order, starting with the second block. If a block has more than one predecessor the block is duplicated (with all outgoing edges) and the edge from previous block of the trace is redirected to the duplicate. This is repeated until every basic block on the trace is visited.
The trace finding and tail duplication is repeated until we cover the ratio of
DYNAMIC_COVERAGE
of the number of dynamic instructions (number of
executed instructions in the function). The process also stops when the number
of instructions exceeds MAXIMAL_CODE_GROWTH
multiplied by the original
number of instructions because we have to keep code growth and compiler
resources under control although the unneeded duplicates will be eliminated
later in Cross-jumping. The remaining basic blocks are the traces of one basic
block.
Finally the traces are connected into one instruction stream. When there is an edge from the last basic block of one trace to the beginning of another these traces will be connected. Remaining chunks are connected in the order of appearance. We do not need to worry how well the traces will be connected because Basic Block Reordering will order the blocks so that the code will be (almost) as fast as possible.
Figure .a shows the original flow graph for an example code on page
. We start finding the trace in basic block S since it
has the highest execution frequency. The trace is grown forwards by basic
blocks C1, T1, C2, T2 and E. While visiting the blocks on the trace in trace
order we duplicate block C2 because it has two predecessors. Basic block T2
has two predecessors now because of duplication of C2 (see figure
.b) so we duplicate it too (see figure
.c). Finally we
duplicate basic block E too. Figure
.d shows the result after tail
duplication of the first trace.
We start finding next trace in E. The trace grows backwards by basic block F2.
It cannot grow by C2 because F2 is not the most likely successor of C2. The
basic block E is then duplicated.
The finding of next trace starts in basic block F1 and grows forwards by blocks
C2, T2 and E. All basic blocks in this trace have one predecessors so no basic
blocks are duplicated now.
There are no more basic blocks left. The first trace is S, C1, T1, C2', T2', E',
the second one is F2, E'' and the last one is F1, C2, T2, E.
Jan Hubicka 2003-05-04