The extensible markup language XML has become the de facto standard for information representation and interchange on the
Internet. XML parsing is a core operation performed on an XML document for it to be accessed and manipulated. This operation
is known to cause performance bottlenecks in applications and systems that process large volumes of XML data. We believe that
parallelism is a natural way to boost performance. Leveraging multicore processors can offer a cost-effective solution, because
future multicore processors will support hundreds of cores, and will offer a high degree of parallelism in hardware. We propose
a data parallel algorithm called ParDOM for XML DOM parsing, that builds an in-memory tree structure for an XML document. ParDOM has two phases. In the first phase, an XML document is partitioned into chunks and parsed in parallel. In the second phase,
partial DOM node tree structures created during the first phase, are linked together (in parallel) to build a complete DOM
node tree. ParDOM offers fine-grained parallelism by adopting a flexible chunking scheme – each chunk can contain an arbitrary number of start
and end XML tags that are not necessarily matched. ParDOM can be conveniently implemented using a data parallel programming model that supports map and sort operations. Through empirical evaluation, we show that ParDOM yields better scalability than PXP [23] – a recently proposed parallel DOM parsing algorithm – on commodity multicore processors.
Furthermore, ParDOM can process a wide-variety of XML datasets with complex structures which PXP fails to parse.