The effect of many optimizations can be maximized by identifying typical execution paths and hot spots. The typical case is branch prediction, where we need to decide which instruction will most likely follow. If we succeed, the program will run faster, because required instructions are pre-fetched and decoded, wrong prediction requires discarding of cacheline and therefore performance decrease. But this not the only case -- almost any optimization will benefit from such information.
One way of achieving this is to run the program with some input data and record the execution path, or, more usually, the block or branch coverage.
Block coverage measures the number of times each basic block is entered, while branch coverage records the number of times each branch was taken. It's obvious that block coverage can be calculated from branch coverage, simply by summing the execution counts of incoming branches of given basic block.
The best results would be reached if we could run the program with the same actual input data, record the coverage, and then recompile the program with feedback from the measured data. However, this is in most cases impossible and it makes no sense to execute the same program on the same input data twice. Therefore, a different approach is used: a set of typical input data is created (called a trial data), coverage is measured for each run and all runs are merged to form a hybrid profile, which should represent coverage of a typical run. The final compilation takes this profile into account and if the trial data were well chosen, the compiled code should run faster.
Previously, gcc contained support for measuring block coverage, but it was broken. More recently, support for measuring branch coverage was introduced, but several fixes needed to be done -- mainly thread-safe profiling and some kind of source code identification, so that measured data cannot be misinterpreted.
To support profiling, special code is inserted on edges of CFG that counts the number of times the edge was used. This is called instrumentation. Additionally some initialization and finalization code is added to record the result. This slows down the program execution a bit (about 5%).
Each module is associated with a file containing collected data. Simple utility named gcov can be used to display measured data, i.e., the coverage (hence the name).
In order to minimize the profiling overhead, not all edges are instrumented. A spanning tree is found in CFG, and only edges outside this spanning tree are instrumented. The execution counts for edges in the spanning tree can be easily calculated by inductive process.
During development, some code may be changed, and therefore profiling data involving this code must be invalidated. Obviously, there is no need to drop entire profile, but simply data for affected functions. The most important task is obviously detecting of changes.
In order to achieve this, we need some more specific information about instrumented functions, which must be stored in both object and file containing the data. This must obviously include offset and number of instrumented edges and function name, just to make sure that data are not garbled. To identify particular code of function, we add a cfg checksum. Because there is difference in code generated with and without instrumentation, we can't calculate checksum just from the RTL (otherwise all profiling data will be invalid when the code is finally compiled without instrumentation), therefore the checksum is calculated only from CFG structure, which is the same in both cases.
The data file consists of several blocks, which represent single runs. Each block consists of list of function names together with their checksums and counter values.
This approach has a few advantages:
The biggest disadvantage is that the data file is growing after every run.
The problems may occur if the profiled program is multi-threaded and the target machine cannot atomically add constants to memory -- the results may get corrupted in this case. Additionally even on the machines supporting this operation we might run into problems if we used some optimizations on the profiling code (it seems reasonable to for example keep some of these counters in registers while we are in loop to reduce time overhead; if more threads did this simultaneously, the results would be unpredictable).
The classical solution to this problem -- using locks -- is not feasible in this case, as they are relatively time consuming (in order of tens of instructions; this itself could cause slowing down several times in short loops, plus there are unpleasant effects of necessary function calls on other optimizations).
We solve this by using per-thread counters. Each thread has its own set of counters created and initialized at its startup and their values are propagated into a central copy at its exit (using standard locking mechanism).
There are several technical problems associated with this solution:
In result, we have increased the overhead of profiling a bit -- in programs that do not use threads we have additional check in each function for this plus counters are not stored in static table now, so we have some overhead on calculating the address; in threaded programs we also call function to get thread specific data address in the beginning of each function. The total overhead increased about times without threads and about times with them. Of course it is still possible to use old profiling code in case your program does not use threads (or you want to risk).
There are some problems still not solved (for example if program ends and some threads are still running, information from their counters is lost) and it would be certainly useful to decrease the overheads (moving reworked part of loop optimizer after profiling code generation helped a bit; moving remaining part (strength reduction) could help a lot).
Jan Hubicka 2003-05-04