We describe an approach and tools for optimizing collective operation spanning tree performance. The allreduce operation is
analyzed using performance data collected at a lower level than by traditional monitoring systems. We calculate latencies
and wait times to detect load balance problems, find subtrees with similar behavior, do cost breakdown, and compare the performance
of two spanning tree configurations. We evaluate the performance of different configurations and mappings of allreduce run
on clusters of different size and with different number of CPUs per host. We achieve a speedup of up to 1.49 for allreduce.
Monitoring overhead is low, and the analysis is simplified since many subtrees have similar behavior. However, the calculated
values have large variations, and reconfiguration may affect unchanged parts.