Subsections

Software Trace Cache

Theory

We use a greedy algorithm similar to Software Trace Cache [7] for basic block reordering. It reorders basic blocks to minimize the number of branches and instruction cache misses and does also a limited amount of code duplication during the process to avoid jumps. The effect is the increase of code sequentiality and thus also the increase of the speed of code execution.

The algorithm is based on profile information. It means that the results depend on the representativity of the training inputs. Thus the code can be worse for some input sets. If you have not profiled the program the algorithm uses the probabilities of branches estimated by a set of heuristics instead of profile information.

We minimize the number of (executed) branches by constructing the most popular sequences of basic blocks (the traces). To do that we are reordering the most probable successor after its predecessor. If the most probable successor has been already laid out we duplicate it to avoid jump if we are not duplicating a large chunk of code. A trivial example of duplication is copying the return instruction instead of jumping to it.

Implementation

The implementation does only intra-procedural optimizations since inter-procedural optimizations are out of reach of current GCC.

The algorithm constructs traces in several rounds. The construction of traces starts from the selected starting basic blocks (seeds). The seeds for the first round are the entry points of the function. When there are at least two seeds the seed with the highest execution frequency is selected first (this places the most frequent traces near each other to avoid instruction cache misses). Then the algorithm constructs a trace from the seed by following the most probable successor. Finally it connects the traces.

As we mentioned before the most frequently executed seed is selected. But there are some exceptions: The seeds whose predecessors already are in any trace or the predecessor edge is a DFS back edge are selected before any other seeds. This heuristics tries to create longer traces than the algorithm would create without this heuristics. In figure [*], the blocks H, A, B and E are already in a trace, the seeds are C and D. The seed C will be selected first because all its predecessors are already in some trace, even if the execution frequency of D is higher.

Figure: Selecting seed `C' before `D'
\scalebox{.5}{
\includegraphics{stc_seed.ps}}

Given a basic block, the algorithm follows the most probable successor from the set of all unvisited (ie. not in any trace) successors and visited successors that can be copied. The decision whether the visited block can be copied is done by simulation of the trace building and counting the number of bytes that would be copied. The block can be copied only if the size that would be copied is lower than a threshold. The threshold depends on the size of the jump instruction and on the frequency of the block.

While constructing the trace the algorithm uses two parameters: Branch Threshold and Exec Threshold. These parameters are used while scanning the successors of the given block. If an edge to a successor of the actual basic block is lower than Branch Threshold or the frequency of the successor is lower than Exec Threshold the successor will be the seed in one of the next rounds. Each round has the Branch Threshold and Exec Threshold less restrictive than the previous one. The last round has to have these parameters set to zero so that the last round could "pick up" remaining blocks. The successors that has not been selected as a continuation of the trace and has not been "sent to" next rounds will be added to seeds of current round and the secondary traces will start in them.

There are several cases of the selected successor's state (the action for the first matching state will be done):

Figure [*] shows an example of the flow graph. We use an Exec Threshold of 50 and Branch Th. of 30% in the first round. Given a seed A we follow the edge to H1, and H2 is sent to next round because of Branch Th. We continue to L1 from H1, B1 is sent to next round. The best (and the only) successor from L1 is H1. We have found a loop that has enough iterations so we rotate it and terminate the trace. There are no more seeds in this round so the next round starts. We use Branch Th. of 0% and Exec Th. of 0 so this is the last round. The seed B1 has higher exec count than H2 so the next trace starts in B1. We follow the edge to R and terminate the trace because the block R is the exit block of the function (in GCC representation, there is an edge from R to EXIT_BLOCK). Next trace starts in H2 and continues to L2, B2 will be a seed for secondary trace. In L2, we find a loop that does not have enough iterations so we just terminate the trace. The last trace starts in B2. The successor R is already in some trace so we check whether we can duplicate it and realize that we can. So we duplicate it. Since it is the exit block of function, we terminate the trace. There are no more unvisited blocks so the trace building is done.

Figure: Trace building
\scalebox{.5}{
\includegraphics{stc_traces.ps}}

Finally we connect the traces. We start with the first trace (that contains the entry block of function). Given a chunk of connected traces we prolong it. If there is an edge from the last block of the chunk to the first block of some trace that trace will be added to the end of the chunk. If there is no such trace the next unconnected trace is added to the chunk.

For the graph in figure [*], the traces happen to be connected in the same order as they were found.

Jan Hubicka 2003-05-04