We explore several methods utilizing system-wide shared memory to improve the performance of MPI-IO, particularly for non-contiguous
file access. We introduce an abstraction called the datatype iterator that permits efficient, dynamic generation of (offset, length) pairs for a given MPI derived datatype. Combining datatype
iterators with overlapped I/O and computation, we demonstrate how a shared memory MPI implementation can utilize more than
90% of the available disk bandwidth (in some cases representing a 5× performance improvement over existing methods) even for
extreme cases of non-contiguous datatypes. We generalize our results to suggest possible parallel I/O performance improvements
on systems without global shared memory.
Keywords Parallel I/O - shared memory - datatype iterator - non-contiguous access - MPI-IO