As shown earlier, our results can be seen as a satisfactory, but some
optimizations does not perform as well as we hoped for. Most important is
tracer, that brings much smaller speedups () than reported by DEC Alpha
compiler team (
). Partly this can be attributed to differences between
Athlon and Alpha architectures as discussed already, but partly it can be due
to architecture of GCC itself. Many optimization passes are basic block based
and thus do not benefit from extended superblock size. We want to update as
many of optimizations as possible to be either superblock based or global.
We also plan to test the superblock scheduler, as superblock scheduler is one of the passes deriving most benefits from trace formation. We have not integrated a working prototype to the source tree yet and we want to see whether some speedups can be gained from it. Impact compiler is using exactly such scheme with great success on Itanium architecture. We will attempt to merge superblock scheduler to the mainline in case that the DFA-branch targeting more powerful global scheduler will not be able to deliver it for next release.
On AMD Athlon GCC current scheduler implementation brings about
speedup. This is much lower compared to other chips, because Athlon has
an on-chip scheduler that is mostly replacing the compiler pass. We
expect that more global scheduling can bring more benefits, as the
on-chip scheduler is not able to perform it.
Important extensions are also desirable to our loop optimizer pass. We should re-implement the strength reduction pass and add more sophisticated unrolling heuristics and induction variable discovery. We also plan to add dependency analysis and array prefetch code generation. This work needs to be closely synchronized with effort on AST-branch, as loop optimizing framework should be distributed between the high level optimizers that are available there, and the lowlevel RTL optimizers we are working on. For instance we do not plan to re-implement the induction variable discovery code in full strength as it is present in old loop optimizer, instead use highlevel optimizer to canonicalize loops to make our job easier. At high level some important information is present (such as whether the counter overflows) we can't derive from RTL representation, so low level optimizers cannot do as good job as high level ones in this case.
Important step will be extending our profile implementation to the tree representation, that is necessary to implement function inlining heuristics without requiring user to profile program twice. Profile based inlining has been reported as the most successful optimization among profile based changes made by the DEC Alpha compiler team. Again we need to wait for AST-branch to mature enough to make this task possible.
Similarly once inter-procedural framework is available, we plan to extend our profile estimation to the inter-procedural level allowing us to predict which functions belong to the hot spots of program and which do not. This has been reported as reliable in Wu and Larus paper[4].
Lots of work need to be done on midlevel RTL representation. While our prototype works reliably on i386 machine, other machine descriptions need to be updated to handle it. We also need to implement some side corners, such as string representation and function calls, we do not represent in midlevel RTL yet. We plan to continue working on making midlevel RTL more high to simplify the RTL generation pass and add multiple passes gradually lowering similarly to the SGI MipsPro compiler design. For instance we would like to have array subscripts represented directly in the midlevel RTL in early stages to simplify dependency analysis.
Interesting direction of future research can be trying to implement path profiling to the GCC. Path profiling is scheme able to measure whether outcome of given branch depends on the path it has been reached from. In case it does, the code can be duplicated (specialized) to assist hardware branch predictor and improve optimizations. Path schedules are also easier to update after other code specializing optimizations, such as loop unrolling and loop peeling. Similarly we plan to add code to measure histograms of number of iterations of each loop so we can better decide what loops to peel or unroll and how much.
We also plan to replace GCC register move pass by extending our register coalescing and implement more optimizers. Our existing optimizers will also definitely benefit from some extra tunning and benchmarking especially on non-i386 architectures.
Important cleanups to the control flow graph representation are possible as well. For instance once all passes are updated to use control flow graph information, we can unlink independent basic blocks in the instruction chain. We should also avoid the notes and jump tables to be present between the basic blocks. This requires high volume changes, so we plan to synchronize this with mainline development to avoid the need to maintain large patch for a too long time.
Jan Hubicka 2003-05-04