Algorithm-Based Fault Location and Recovery for Matrix Computations on Multiprocessor Systems

Algorithm-Based Fault Location and Recovery for Matrix Computations on Multiprocessor Systems

Amber Roy-Chowdhury , Member, IEEE

and

Prithviraj Banerjee, Fellow,IEEE

IEEE Transactions On Computers, Vol 45, No. 11,November 1996

Abstract

Algorithm-based fault-tolerance(ABFT) is an inexpensive method of incorporating fault-tolerance into existing applications. In this paper, we first present a general scheme for performing fault-location and recovery under the ABFT framework. Our fault model assumes that a faulty processor can corrupt all the data it possesses. The fault-location scheme is a application of system-level diagnosis theory to the ABFT framework ,while the fault-recovery scheme uses ideas from coding theory to maintain redundant data and uses this to recover corrupted data in the event of processor failures.

Introduction

We devise a general method for error location and correction for an algorithm executing on a multiprocessor system where up to t processors can be faulty . The ideas for error location are borrowed from system level diagnosis theory . The idea for error correction is based on weighted checksums .However, since we decouple the error location and correction problems , we are able to give a general procedure for recovering from errors introduced by t faulty processors.

Location

We demonstrate it on three parallel numerical algorithms—matrix multiplication , QR factorization , and Gaussian elimination .The location method is directly derived from the theory of system level diagnosis and is well known that a one-step t-diagnosable system must satisfy the following two constraints:

There must be at least 2t+1 nodes in the system.
Each node must be diagnosed by at least t other nodes.

Definition 1. A system is a directed graph G=(V,E) where an edge E exists from a vertex V to a vertex V if and only if j=(I+m) mod n where n is the number of vertices in G, is an integer and m = 1,2,…,t.

where n=2t+1 and and n are relatively prime.

An example of a system with n =5 ,i.e. , an optimal configuration for one-step 2-fault diagnosability , is shown in Fig.1.

An ABFT system may be modified to perform t-fault location as follows.

The set of p processors is grouped into p/(2t+1) disjoint sets consisting of 2t+1 processors each.

Proc’ =======================è Proc

send message to Proc

Processor proc proceeds to use this data in its updates only if all t checks pass.

u t faults affect the checking processors & all the checks pass

è proc’ is faulty-free.

è the data is uncorrupted and still be used by proc .

u a fault affects proc’ & causes data to be corrupted & at most t faults occur & at least one check of the data is performed by a fault-free processor and is guaranteed to fail .

è computations stopped and a checking phase initiated .

Since the syndrome corresponding to any t or fewer faulty processors is guaranteed to be unique in a system, the t faulty processors may be identified. The normal ABFT checks are carried out as usual at appropriate points in the algorithm .

Recovery

The extension to perform recovery from t faults uses ideas from the theory of error correcting codes and is primarily applicable to algorithms which perform linear operations on data .

LEMMA 1. Vectors which are linearly independent over GF(q) (q prime)

,where GF(q) denotes the finite field with q elements , are also linearly independent over the field of real number .

A system with a total p processors :

p è P : 0 ~ (p-t-1) : Computation processors

: (p-t) ~ (p-1) : Check processors

: data owned by the ( m x n matrix)

: data owned by

Assume that all steps during the computation , the following invariant is maintained

= , 0jt-1 (1)

A suitable choice of weights to maximize chances of recovery following processor faults .

= (, ……,)

= (, ….., )

W= (2)

Created by : Duan-Yu Chen

Date : June 2, 1997

We have the following result

LEMMA 2. Let be a primitive element over GF(q) , where q is any prime greater than p-t . Let the entries of W be chosen so that = mod q . Consider any submatrix of W consisting of any c consecutive rows of W. Then every c columns of are linearly independent over the field of real numbers.

The following two corollaries follow immediately from Lemma 2.

Corollary 1. Every t columns of W are linearly independent over the field real numbers.

Corollary 2. C x C submatrix of if of full rank.

F : the set of faulty processors

: FP ,,…,

: F ,,…,

P-: ,,…,

-: ,,..,

| | =

|| =

= - (3)

0jt--1

(4)

Dim()=(t-) x

Both sides of (4) are premultiplied by to get the new system

è ()=(-) (5)

Created by : Duan-Yu Chen

Date : June 2,1997

An LU decomposition of the matrix () is then performed!

If Rank() is

è the matrix is of full rank .

è all unknown elements can be recovery by backsolves.

() is full rank ß à Rank()=

è In these cases , recovery from the pattern of faulty processors is possible!

n We now examine in what cases the matrix () is of full rank since in these cases , corrupted strips of C maybe be recovered.

Theorem 1. If only processors in P fail, then t faults can be tolerated by the algorithm for multiple fault recovery .

Theorem 2. The algorithm for recovery from multiple faults can tolerate any fault pattern involving 2(-1) or fewer processor.

Ü In the event of faults affecting the set P,and the total number of faults being less than 2(-1) , we are guaranteed to find consecutive nonfaulty peocessors in the set .

n Once the corrupted data has been recovered

è the computations of faulty processors à nonfaulty check processors.

n Expense of the recovery procedure :

Surviving processors à Computation processors

æ Check processors

Data à Computation processors

Weighted sums of the data à Check processors

Created by : Duan-Yu Chen

Date : June 2 , 1997

Example

No. of processors : 14

Required : 3 fault location & 2 fault recovery

Computation processors : ,….,

Check processors : ,,

Data distributed to computation processors:,..,

Weighted checksums :,,

= ++……..+

=+2+4+8+3+6+12+11+9+5+10

=+4+3+12+9+10++4+3+12+9

14 processors grouped into two sets : processors 0~6 & 7~13

See Fig. 2.

Processor 11 and 12 form a pair of consecutive nonfaulty check processors.Considering only the checksums on processors 11 and 12

è +=-(++++++++)

2+4=-(+8+3+6+12+11+9+5+10)

Ü Right hand side of the above equation consists of matrices computed by nonfaulty processors è and may be recovered .

Experiment Results

See Fig.3 Fig.4. Fig. 5. Fig. 6.

Summary

Ü Develop algorithm-based schemes for fault location and recovery multiprocessor systems.

Created by : Duan-Yu Chen

Date : June 2, 1997

Ü The weights chosen to compute the redundant data strips for the check processors are chosen from the entries in the parity check matrix of a Reed-Solomon code, and it is shown that this choice of weights guarantees recovery from 2(-1) or fewer faults.

Created by : Duan-Yu Chen

Date : June 2,1997