The DiamondCandy Algorithm for Maximum Performance Vectorized Cross-Stencil Computation
An advance in the search for the 4D time-space decomposition that leads to an efficient vectorized cross-stencil implementation is presented here. The new algorithm is called DiamondCandy. It is built from the dependency and influence conoids of the scheme stencil. It has high locality in terms of the operational intensity, SIMD parallelism support, and is easy to implement. The implementation details are shown to illustrate how both instruction and data levels of parallelism are used for many-core CPU. The test run results show that it performs an order of magnitude better than the traditional approach, and that the performance does not decline with the increase of the data size.
Stencil, LRnLA, Wave Equation, time skewing, many-core