Although the profile feedback brings very accurate information about program behavior, it has one major drawback -- an easiness to use. Majority of users are not willing to construct train runs for their programs and compile twice and in some cases (such as interactive programs, programs whose behavior depend on configuration or in embedded systems) this is even impossible to do.
An attractive alternative to feedback based profile is static branch estimation. Ball and Larus [3] describe a set of simple heuristics to predict conditional branch direction (taken or not taken) in about of executed branches, that is a remarkable result, even when upper bound (the perfect predictor magically knowing in advance the actual program behavior) is about .
Wu and Larus [4] extended this method to estimate conditional branch outcomes (probability that branch is taken) and propagate this information over intra-procedural control flow graph to compute expected number of executions of each basic block and they got satisfactory results predicting well the majority of hot spots in the program.
The original branch prediction code based on Ball and Larus heuristics was integrated into GCC as part of IA-64 porting project by Stan Cox and Jason Eckhart, but later it showed to be seriously broken in our experiments predicting loop branches in wrong direction resulting in about successful predictions compared to and it implemented just subset of predictions described in the paper.
Goal of our implementation was to allow the front-end and early optimization participate on branch predicting process. This differed from the Ball and Larus approach that used final assembly exclusively for the heuristics most probably because the sources of actual compiler were not available to the researches. Because more information is available to compiler than one can easily derive from resulting code, we can predict better. We can also re-use the analyzers implemented in existing optimizer passes.
Our approach has three major passes. In the first pass, number of heuristics are tried on each conditional jump and if some of them predicts the direction of jump (taken or not taken), the information is stored into conditional jump instruction itself.
In the second pass we combine all the information gathered into single value, the outcome -- estimated probability that given conditional jump will be taken.
In last pass we propagate the information into block coverage (see chapter ) constructing equivalent data to the ones measured by the profiler. Optimization passes then may handle both estimated and feedback based profiles equivalently.
We added a possibility to attach REG_BR_PRED
note to each conditional
branch in the RTL instruction stream. The note contains two values -- identifier of the
predictor algorithm for later analysis and the outcome the branch is supposed to
have.
Each compilation pass run before profile estimation pass may freely predict any conditional jump in the code. For instance Ball and Larus suggest to predict a conditional jump around a loop as not taken. This heuristic comes from observation that most compilers copy test of while(cond)-type loops before the loop construct and loops tend to iterate. In our implementation the loop test duplication pass simply predicts each test copied resulting in better characteristics of this particular heuristics.
We found it difficult to predict whether given heuristic is good or not. To allow
easy evaluation of new heuristics, we implemented tool analyze_branches
to summarize information about behavior of all GCC heuristics based on real
program behavior read from profile feedback. It can be used as follows:
analyze_branches *.bp
on produced files to summarize information about heuristics
Following information is gathered about each of implemented prediction heuristics:
It is obvious that heuristics in order to be successful should have dynamic hitrate and coverage as high as possible.
Many run-time properties of program can be guessed from its source code. But some of properties are either completely lost or at least heavily mangled by translation into a low-level RTL representation. For example, usage of goto statement usually indicates some exceptional situation, like error handling; however, we cannot recognize jump created by goto statement from any other unconditional jump in the program. We can easily detect this statement during RTL generation, but using this information immediately would be hard and cumbersome.
To enable us to use these properties, we instead pass this information to later stages by emitting a special note (NOTE_INSN_PREDICTION) into insn stream. This note carries also additional information determining the kind of prediction and optionally some other attributes. In early stage (to avoid the information carried being influenced by optimizations) of the compilation we transform these notes into REG_PREDICTION notes placed directly on conditional jumps responsible for using the predicted branch (this property can be expressed as being at the end of the nearest basic block such that it dominates block containing the note, but is not postdominated by it).
Once all predictions are done, the prediction notes on each instruction must be combined into overall outcome of branch. When no heuristic applies to the branch, we predict it with outcome and for purposes of debug output we refer this as ``no prediction heuristic''.
When single prediction applies, the situation is simple as well. Complicated is situation where multiple heuristics apply to single instruction. The method used by Ball and Larus paper is to use first matching heuristic in fixed priority order. While this method has been proved to work pretty well, it is difficult to set up the order of heuristics making development somewhat difficult. Wu and Larus paper suggests the use of Dempster Shaffer-theory of evidence:
We may threat predictions as hypothesis on theorem saying that given branch is taken. When there are two predictions with two different hitrates, and the combined probability can be computed as follows:
Our method uses a combination of both methods. For strong heuristics, such as loop heuristics, we use first match method and for weak heuristics we combine the outcomes. When at least one of strong heuristics applies to the branch, we use the first one and when no applies, we use Dempster Shaffer theory based method to combine values. This appears to bring the better from both worlds.
The perfect solution is not possible, as there are too many heuristics to measure outcomes for all the possible combinations. [10] suggests to use neural network based method to combine values, but we decided to not use it due to large complexity of solution giving relatively little extra value. GCC needs to be understood by its developers and we guess most of them are not experts for artificial intelligence.
Andreas Jaeger has set up an tester (http://www.sue.de/~aj/SPEC) that
runs weekly SPEC2000 benchmark suite on current GCC tree and stores results of
analyze_branches
to the log file. We use those results as hitrate values for
the prediction combining algorithm. predict.def
describes all heuristics, their names, hitrates and prediction combination
methods.
The exact implementation is described in algorithm .
Table contains table of heuristics we have implemented and parameters for the branch combining algorithm.
High level heuristics are the ones supplied by front-end and based on source language. Currently we have extended only the C and C++ front-ends to do so, but in future we plan to add heuristics to Java and Fortran front-end as well to handle for instance array bounds checking code or exception handling.
__builtin_expect
allowing programmer
to explicitly state the outcome of given conditional.
The edges belonging to natural loops in code are one of the best predictable edges in the program.
Ball and Larus paper suggests to predict all edges leaving loop as not taken using single loop heuristic based on simple observation that program must loop in order to do useful work and consume CPU time.
We have decided to split this single heuristic into multiple special cases in order to increase hitrate of common loop branches that in turn increase predicted amount of iterations in each loop allowing us to optimize more aggressively.
We bypass this heuristic for edges marked by continue heuristic as discussed earlier.
Unlike Ball and Larus heuristics, we do not mark any branch around the loop, as the explicit conditionals are often limiting checks that allows us to increase hitrate of the predictor, at the expense of reducing its coverage.
For a and b pointers, it we predict a==b and a==NULL to be false. The negations of the conditions are predicted to be true.
Most of values in program are positive, so for a integer, it we predict a>0, a>=0, a>=1 and a>=-1 as true. The negations of condition are predicted to be false.
Once probabilities of all edges in CFG are known, we produce estimated profile. In the acyclic tree, this is easily done by giving entry node frequency 1 and walking the tree down, computing frequencies of each node as sum of frequencies (once known) of its predecessors multiplied by the probability of given edge.
In the cyclic tree, the exact propagation is difficult and time consuming. We adopt simple algorithm described in [4] -- to walk the loop tree from innermost to outermost, and at each level compute estimated number of iterations by determining a probability the loop-back edge will be reached when header block is executed. This can be done by simple modification of the algorithm above -- when header of inner loop is reached, its estimated frequency is multiplied by already predicted number of iterations.
We found this algorithm fast and giving satisfactory results, as majority of functions in practice have reducible graph (ie. non-natural loops do not exist).
Jan Hubicka 2003-05-04