Optimal Architectures and Algorithms for Mesh-Connected Parallel Computers with Separable Row/Column Buses

Optimal Architectures and Algorithms for Mesh-Connected Parallel Computers with Separable Row/Column Buses

Mauricio J. Serrano and Behrooz Parhami

IEEE Transactions on Parallel and Distributed Systems, Vol. 4, No. 10, pp. 1073-1080, Oct. 1993.

Abstract

For meshes with separable row/column buses, they showed how semigroup and prefix computations can be performed with the same asymptotic time complexity O(N^1/8) without the provision of buses for every row and every column and discuss the VLSI implications of this new architecture. They found that square meshes are not optimal for the above algorithms and the time complexity can be reduced by using rectangular meshes.

Introduction

Fig. 1. Different interconnections schemes for a mesh:(a)mesh with row and column broadcast buses, (b) hypermesh, (c) mesh with multiple global buses, and (d) mesh with separable row and column buses.

The Proposed Scheme

-An N^5/8 x N^3/8 mesh composed of N^1/8 x N^1/8 PE(processor element) blocks.

¨ Semigroup Computation Algorithm

A semigroup computation can be described by a pair of (,S), where is an associative binary operator and S is a set of data. The problem is to compute a₀a₁…

a_N-1.

Step1 ( Block Reduction): Perform the semigroup computation for each blosk. O(N^1/8). The problem size from N to N^3/4.

Step 2a (Row-Group Reduction): O(N^1/8). The problem size becomes to N^5/8.

Step 2b (Row-Band Reduction): Copy the partial results from row-group leaders to row-band leaders and perform the semigroup computation. O(N^1/8). The problem size becomes to N^1/2.

Step 3a (Column-Group Reduction): O(N^1/8). The problem size becomes to N^3/8.

Step 3b (Replication of Values): Broadcast the partial result of each leftmost column leader to all PE’s connected to a row bus, constant time.

Step 3c (Column-to-Row Transposition): Partition the N^3/8 partial results into N^1/4 groups and each having N^1/8 elements, so that each group can use a column. O(N^1/8). The problem size becomes to N^1/4.

Step 4a/b (0th-Row Reduction): Apply 2a and 2b. O(N^1/8).

¨ Prefix Computation

Prefix computation defined as S_i=a₀a₁…

a_i. The input data items are stored in the mesh in increasing order of block numbers and in increasing order(row-major) of PE number numbers.

Step1 : Local Prefixes for each Block.

Step 2 (Row Reduction): Generate row-group prefixes by broadcasting data using row bus sections. The rightmost column contains the row-band prefixes.

Step 3 (Column Reduction): Generate the column group prefixes by broadcasting data on column bus sections (the rightmost column bus). Broadcast the resulting N^3/8 prefixes within rows and do column-to-row transposition.

Step 4 (0th-Row Reduction): Do a prefix computation for the items in Row 0.

Step 5-8 (Backward Phase): These constitute Phase 2 of the algorithm. Go from global to local. Steps 1 through 4 are performed backwards to obtain the prefix in each PE.

n Since any step in this algorithm takes O(N^1/8), the overall time complexity is O(N^1/8).

Extension to an Arbitrary Size Mesh for Semigroup Computation

Definition:

N^r x N^c Rectangular mesh with N^r rows and N^c columns(r+c=1,rc).

N^k x N^k Size of a block of PEs connected only by local links(k<1/2).

R Number of hierarchical sectioning levels for row buses.

C Number of hierarchical sectioning levels for column buses.

the optimal time complexity is O(N^1/(2R+C+2)).

the optimal mesh has N^r=N^{(R+C+1)/(2R+C+2)} x N^{(R+1)/(2R+C+2)}.

n They concluded that the optimal mesh is always retangular because R+C+1R+1. And they check R=C=2 as their previous algorithm.

n For R=C=L, O(N^2L/(3L+2)). Compare the result with reported by Carlson using a hierachy of L global buses connecting all PE’s, achieving a running time of O(N^(1/L+2)).

[Created by: Lee-Chuan Fan

Date: Apr. 23, 1997]