Article Preview
TopIntroduction
Tridiagonal system solvers have been accredited as vital building block in many of the scientific and engineering applications, especially when dealing with problems in Dynamics (Zhou, 2017; Sakharnykh, 2009). Although there are many parallel and sequential algorithms available for solving tridiagonal system, a high performance, architecture independent performance optimal tridiagonal solver library is still challenging task. The sparseness with low operational intensity and parallel computational complexity of tridiagonal matrix demands custom hardware like field programmable gate array (FPGA) or graphical processing unit (GPU) for developing high performance solver.
Many efforts were carried out to develop performance optimal tridiagonal solver via software optimization, custom hardware platforms and hybrid parallel algorithms (Chang, 2014). For solving discretized Poisson equation on a rectangular domain, Hockney et al. developed Cyclic Reduction (CR) algorithm. The CR method performs odd-even reduction in two phase namely forward reduction & backward substitution that can be done in parallel. The authors also proposed a more efficient parallel version of CR where it performs only forward reduction called Parallel Cyclic Reduction (Hockney, 1981). Zhang et al did a comprehensive analysis about the three major tridiagonal solving algorithms namely CR, PCR and Recursive Doubling (RD) and proposed hybrid models of those algorithms. The authors made a comparative performance study with GPU implementation (Zhang et al., 2010). For fluid simulation, Sakharnykh has presented PCR-pThomas algorithm on GPU platform succeeding thread level Thomas algorithm (Sakharnykh, 2009). Wang et al. optimized PCR-pThomas algorithm on Intel MIC architecture by efficient exploitation of registers and in-core vectorization feature (Wang et al., 2016). The authors then introduced an improved variant namely register-PCR-half-pTthomas algorithm that boasts minimal computational cost and memory utilization (Wang et al., 2016). Tang et al proposed hybrid parallel tridiagonal solver that combines CR and partition method, an approach based on divide and conquer technique (Tang et al., 2014). While Quesada-Barriuso et al. compares and contrasts the parallel efficiency of fine-grained CR algorithm towards coarse grained Bondelli algorithm on both multicore CPU architectures and GPU platforms (Barriuso et al., 2011). It is explicit that the huge computational complexity of parallel high-performance tridiagonal solvers, demands for immense computational power which is justified by coprocessors like GPU, FPGA, MIC, etc.