Cell Processing for Two Scientific Computing Kernels

Cell Processing for Two Scientific Computing Kernels

Meilian Xu, Parimala Thulasiraman, Ruppa K. Thulasiram
Copyright: © 2010 |Pages: 25
DOI: 10.4018/978-1-60566-661-7.ch014
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

This chapter uses two scientific computing kernels to illustrate challenges of designing parallel algorithms for one heterogeneous multi-core processor, the Cell Broadband Engine processor (Cell/B.E.). It describes the limitation of the current parallel systems using single-core processors as building blocks. The limitation deteriorates the performance of applications which have data-intensive and computationintensive kernels such as Finite Difference Time Domain (FDTD) and Fast Fourier Transform (FFT). FDTD is a regular problem with nearest neighbour comminuncation pattern under synchronization constraint. FFT based on indirect swap network (ISN) modifies the data mapping in traditional Cooley- Tukey butterfly network to improve data locality, hence reducing the communication and synchronization overhead. The authors hope to unleash the Cell/B.E. and design parallel FDTD and parallel FFT based on ISN by taking into account unique features of Cell/B.E. such as its eight SIMD processing units on the single chip and its high-speed on-chip bus.
Chapter Preview
Top

Introduction

High performance computing (HPC) clusters provide increased performance by splitting the computational tasks among the nodes in the cluster and have been commonly used to study scientific computing applications. These clusters are cost effective, scalable and run standard software libraries such as MPI which are specifically designed to develop scientific application programs on HPC. They are also comparable in performance speed and availability to supercomputers.A typical example is the Beowulf cluster which uses commercial off-the-shelf computers to produce a cost-effective alternative to a traditional supercomputer. In the list of top 500 fastest computers in top500.org, many of them are pure clusters. One of the crucial issue in clusters is the communication bandwidth. High speed interconnection networks such as Infiniband have paved the way for increased performance gain in clusters.

However, the development trend in clusters has been greatly influenced by hardware constraints leading to three walls called brick wall (Asanovic et al., 2006). According to Moore's law, the number of transistors on the chip will double every 18 to 24 months. However, the speed of processor clocks has not kept up with the increased transistor design. This is due to the physical constraints imposed on clock speed increase. For example, too much heat dissipation leads to complicated cooling techniques to prevent the hardware from deteriorating. And too much power consumption daunts the customers from adopting new hardware, increasing the cost of commodity applications. Power consumption is doubling with the doubling of operating frequency leading to the first of the three walls, power wall. On the other hand, even with the increased processor frequency achieved so far, the system performance has not improved significantly in comparison to the increased clock speeds. In many applications, the data size operated on by each processor changes dynamically, which in turn, affects the computational requirements of the problem leading to communication/synchronization latencies and load imbalance. Multithreading is one way of tolerating latencies. However, previous research (Thulasiram & Thulasiraman, 2003; Thulasiraman, Khokhar, Heber, & Gao, 2004) has indicated that though multithreading solves the latency problem to some extent by keeping all processors busy exploiting parallelism in an application, it has not been enough. Accessing data in such applications greatly affects memory access efficiency due to the non-uniform memory access patterns that are unknown until runtime. In addition, the gap between the processor speed and memory speed is widening as processor speed increases more rapidly than memory speed leading to the second wall, memory wall. To solve this problem, many memory levels are incorporated which requires exotic management strategies. However, the time and effort required to extract the full benefits of these features detracts from the effort exerted on real coding and optimization. Furthermore, it has become a very difficult task for algorithm designers to fully exploit instruction level parallelism (ILP) to utilize the processor resources effectively to keep the processors busy. Solutions to this problem have been in using deep pipelines with out-of-order execution. However, this approach impacts the performance of the algorithm due to the high penalty paid on wrong branch predictions. This leads to the third wall, ILP wall. These three walls force architecture designers to develop solutions that can sustain the requirements imposed by applications and provide solutions to some of the problems imposed by hardware in traditional multiprocessors.

Key Terms in this Chapter

Cooley-Tukey FFT Algorithm: The Cooley-Tukey algorithm, named after J.W. Cooley and John Tukey, is the most common FFT algorithm. It re-express the DFT of an arbitrary composite size N = N1N2 in terms of smaller DFTs of size N1 and N2, recursively, in order to reduce the computation time to O(N logN).

Finite Difference Time Domain: Finite Difference Time Domain (FDTD) is a numerical technique proposed by Yee in 1966 to solve Maxwell’s equations in electromagentics fields. It discretizes a 3D field into a mesh of cubic cells or a 2D field into a grid of rectangular cells using central-difference approximations. It is a time-stepping algorithm. Each cell has electrical field vector components and magnetic field vector components which are updated at alternate half time steps in a leapfrog scheme in time.

Discrete Fourier Transform: Discrete Fourier transform (DFT) is one of the specific forms of Fourier analysis which transforms one function in the time domain into another function in the frequency domain. A DFT decomposes a sequence of values into components of different frequencies. A direct way to compute a DFT of N points takes O(N2) arithmetical operations. The inverse DFT (IDFT) transforms one function in the frequency domain into another function in the time domain.

Fast Fourier Transform: Fast Fourier Transform (FFT) is an efficient algorithm to compute the DFT and its inverse. Instead of O(N2) arithmetical operations in a direct way to compute a DFT of N points, FFT can compute the same result in only O(N logN) operations.

Single Instruction Multiple Data: Single Instruction Multiple Data (SIMD) is a technique used to achieve data level parallelism, the same instruction applied to multiple data. It is one category of processor architecture proposed in Flynn’s taxonomy. SIMD was popular in large-scale supercomputers with vector processors. Now, smaller-scale SIMD operations have become widespread in general-purpose computers, such as SSE instructions in regular Intel processors and SIMD instruction set in Cell/B.E. processors.

Multi-Core Processor: A multi-core processor combines two or more independent cores (normally a CPU) into a single package composed of a single die, or more dies packaged together. It is also called chip-level multiprocessor (CMP) and implements multiprocessing in a signle physical package. If the cores are identical, the processor is called homogeneous multi-core processor, such as AMD Opteron dual-core processor. If the cores are not the same, the processor is called heterogeneous multi-core processor, such as Cell/B.E. processor.

Indirect Swap Network: Indirect Swap Network (ISN) is an improvement to the Cooley-Tukey butterfly network. It aims to reduce data communication by half than traditional Cooley-Tukey network by introducing inter-processor permutation when implemented on parallel systems using message passing model.

Cell Broadband Engine Architecture: Cell Broadband Engine Architecture, also abbreviated CBEA or Cell/B.E. or called Cell as a shorthand, is a microprocessor architecture jointly developed by Sony, Toshiba, and IBM. It is a heterogeneous multi-core architecture by combining one general-purpose Power architecture core PPE (Power Processor Element) with eight streamline coprocessing elements SPEs (Synergistic Processor Elements). PPE supports the operating system and is mainly used for control tasks. SPEs support SIMD (Single Instruction Multiple Data) processing. Each SPE has 128 entries 128-bit registers and 256KB local memory (called local storage) both for instructions and data. The PPE, SPEs and memory subsystem are connected by on-chip bus Element Interconnect Bus (EIB). Cell is designed as a general-purpose high-performance processor to bridge the gap between conventional desktop processors and more specialized high-performance processors. It has been installed in Sony PlayStation 3.

Complete Chapter List

Search this Book:
Reset