Software pipelining is a technique to improve the performance of a loop by overlapping the execution of several iterations.
The execution of a software-pipelined loop goes through three phases: prolog, kernel, and epilog. Software pipelining works
best if most of the time is spent in the kernel phase rather than in the prolog or epilog phases. This can happen only if
the trip count of a pipelined loop is large enough to amortize the overhead of prolog and epilog phases. When a software-pipelined
loop is part of a loop nest, the overhead of filling and draining the pipeline is incurred for every iteration of the outer
loop. This paper introduces two novel methods to minimize the overhead of software-pipeline fill/drain in nested loops. In
effect, these methods overlap the draining of the software pipeline corresponding to one outer loop iteration with the filling
of the software pipeline corresponding to one or more subsequent outer loop iterations. This results in better instruction-level
parallelism (ILP) for the loop nest, particularly for loop nests in which the trip counts of inner loops are small. These
methods exploit Itanium™ architecture software pipelining features such as predication, register rotation, and explicit epilog
stage control, to minimize the code size overhead associated with such a transformation. However, the key idea behind these
methods is applicable to other architectures as well. These methods have been prototyped in the Intel optimizing compiler
for the Itanium™ processor. Experimental results on SPEC2000 benchmark programs are presented.