Tracer

Tracer [2] performs the tail duplication which is needed for superblock formation. Usually it is used for scheduling but we have not tested it because GCC does not have general Superblock Scheduler (GCC has it only for IA-64) but we plan to generalize it in the future. We do not suppose that the scheduling will be profitable for Athlon (where we tested everything) because Athlon itself does the scheduling. On Athlon, we use Tracer to improve the effect of other optimizations. We do not need to worry about the code growth too much because Cross-jumping removes the unneeded tail duplicates when optimizations are not performed. The simple example how tracer helps other optimizations is the following code. Tracer makes possible for CSE to remove the second condition.

Example: Code where Tracer helps other optimizations.
$\begin{algorithmic} \IF{condition} \STATE{t1 ()} \ELSE \STATE{f1 ()} \ENDIF \STATE{c ()} \IF{condition} \STATE{t2 ()} \ELSE \STATE{f2 ()} \ENDIF \end{algorithmic}$

The start trace finding from the basic block that is not in any trace yet and has the highest execution frequency. The Fibonacci heap is used for fast finding of such a block. The trace is then grown backwards and forwards using mutual most likely heuristic. The heuristic requires that for block A that is followed by block B in the trace, A must be B's most likely predecessor and B must be A's most likely successor. While the trace is growing backwards the backward growing can also stop if the probability of the most likely predecessor is lower than MIN_BRANCH_RATIO and while the trace is growing forwards the growing can also stop if the probability of the most likely successor is lower than MIN_BRANCH_PROBABILITY.

When we have a trace we perform tail duplication. We visit the blocks on the trace in trace order, starting with the second block. If a block has more than one predecessor the block is duplicated (with all outgoing edges) and the edge from previous block of the trace is redirected to the duplicate. This is repeated until every basic block on the trace is visited.

The trace finding and tail duplication is repeated until we cover the ratio of DYNAMIC_COVERAGE of the number of dynamic instructions (number of executed instructions in the function). The process also stops when the number of instructions exceeds MAXIMAL_CODE_GROWTH multiplied by the original number of instructions because we have to keep code growth and compiler resources under control although the unneeded duplicates will be eliminated later in Cross-jumping. The remaining basic blocks are the traces of one basic block.

Finally the traces are connected into one instruction stream. When there is an edge from the last basic block of one trace to the beginning of another these traces will be connected. Remaining chunks are connected in the order of appearance. We do not need to worry how well the traces will be connected because Basic Block Reordering will order the blocks so that the code will be (almost) as fast as possible.

Figure

.a shows the original flow graph for an example code on page

. We start finding the trace in basic block S since it has the highest execution frequency. The trace is grown forwards by basic blocks C1, T1, C2, T2 and E. While visiting the blocks on the trace in trace order we duplicate block C2 because it has two predecessors. Basic block T2 has two predecessors now because of duplication of C2 (see figure

.b) so we duplicate it too (see figure

.c). Finally we duplicate basic block E too. Figure

.d shows the result after tail duplication of the first trace. We start finding next trace in E. The trace grows backwards by basic block F2. It cannot grow by C2 because F2 is not the most likely successor of C2. The basic block E is then duplicated. The finding of next trace starts in basic block F1 and grows forwards by blocks C2, T2 and E. All basic blocks in this trace have one predecessors so no basic blocks are duplicated now. There are no more basic blocks left. The first trace is S, C1, T1, C2', T2', E', the second one is F2, E'' and the last one is F1, C2, T2, E.

**Figure:** (a) The original flow graph for example code on page . (b) A part of flow graph after duplicating C2. (c) A part of flow graph after duplicating T2. (d) The flow graph after tail duplicating the first trace.
$\scalebox{.5}{ \includegraphics{tracer.ps}}$