Computations from many scientific and engineering domains use irregularmeshes and/or sparse matrices. The codes expressing
these computations involve
irregular reductions. The main characteristics of irregular reduction loops are 1) elements of left-hand-side arrays may be incremented in multiple
iterations of the loop, but only using associative and commutative operations (these arrays are called
reduction arrays), 2) there are no loop carried dependencies, except on elements of reduction arrays, and 3) one or more arrays are accessed
using indirection arrays.
It is very challenging to efficiently parallelize codes involving irregular reductions, especially on large parallel machines.
Because of accesses through indirection arrays, communication and locality are hard to manage. Not only is the total communication
volume large, but the communication requirements typically cannot be determined at compile time. It is also hard to efficiently
allocate space for non-local elements. Particularly, there are no effective solutions for parallelization of adaptive irregular reductions. In an adaptive irregular reduction, the elements of the indirection arrays are modified after every
few iterations. This significantly increases the overhead associated with partitioning and runtime preprocessing routines
[7,11], which have been critical for achieving locality, communication efficiency, and effective buffer management.
Recently, there has been much interest in multithreaded architectures. A multiprocessor based upon a multithreaded architecture
supports multiple threads of execution on each processor. These architectures also support low-cost thread initiation, low-overhead
communication, and efficient communication and synchronization between threads on different processors. Multithreaded architectures
are considered a promising medium for scalable parallelization of irregular applications, where the frequent communication
and synchronization make parallelization hard on conventional parallel machines.
We have developed an execution strategy for irregular reductions on a multithreaded architecture. The key idea in our execution
model is that the frequency and volume of communication is independent of the contents of the indirection arrays. Thus, unlike
other approaches to scalable parallelization of irregular reductions, our approach does not require mesh partitioning [1],
array renumbering
Though the frequency and volume of communication is independent of the data distribution, other factors are not.