Algorithm-Based Fault Location and Recovery for Matrix Computations on Multiprocessor Systems
Amber Roy-Chowdhury , Member, IEEE
and
Prithviraj Banerjee, Fellow,IEEE
IEEE Transactions On Computers, Vol 45, No. 11,November 1996
Abstract
Algorithm-based fault-tolerance(ABFT) is an inexpensive method of incorporating fault-tolerance into existing applications. In this paper, we first present a general scheme for performing fault-location and recovery under the ABFT framework. Our fault model assumes that a faulty processor can corrupt all the data it possesses. The fault-location scheme is a application of system-level diagnosis theory to the ABFT framework ,while the fault-recovery scheme uses ideas from coding theory to maintain redundant data and uses this to recover corrupted data in the event of processor failures.
We devise a general method for error location and correction for an algorithm executing on a multiprocessor system where up to t processors can be faulty . The ideas for error location are borrowed from system level diagnosis theory . The idea for error correction is based on weighted checksums .However, since we decouple the error location and correction problems , we are able to give a general procedure for recovering from errors introduced by t faulty processors.
We demonstrate it on three parallel numerical algorithms—matrix multiplication , QR factorization , and Gaussian elimination .The location method is directly derived from the theory of system level diagnosis and is well known that a one-step t-diagnosable system must satisfy the following two constraints:
Definition 1. A system is a directed graph G=(V,E) where an edge
E exists from a vertex
V to a vertex
V if and only if j=(I+
m) mod n where n is the number of vertices in G,
is an integer and m = 1,2,…,t.
where n=2t+1 and and n are relatively prime.
An example of a system with n =5 ,i.e. , an optimal configuration for one-step 2-fault diagnosability , is shown in Fig.1.
An ABFT system may be modified to perform t-fault location as follows.
The set of p processors is grouped into p/(2t+1) disjoint sets consisting of 2t+1 processors each.
Proc’ =======================
è Procsend message to Proc
Processor proc proceeds to use this data in its updates only if all t checks pass.
u
t faults affect the checking processors & all the checks passè
proc’ is faulty-free.è
the data is uncorrupted and still be used by proc .u
a fault affects proc’ & causes data to be corrupted & at most t faults occur & at least one check of the data is performed by a fault-free processor and is guaranteed to fail .è
computations stopped and a checking phase initiated .
Since the syndrome corresponding to any t or fewer faulty processors is guaranteed to be unique in a system, the t faulty processors may be identified. The normal ABFT checks are carried out as usual at appropriate points in the algorithm .
The extension to perform recovery from t faults uses ideas from the theory of error correcting codes and is primarily applicable to algorithms which perform linear operations on data .
LEMMA 1. Vectors which are linearly independent over GF(q) (q prime)
,where GF(q) denotes the finite field with q elements , are also linearly independent over the field of real number .
A system with a total p processors :
p
Assume that all steps during the computation , the following invariant is maintained
A suitable choice of weights to maximize chances of recovery following processor faults .
W= (2)
Created by : Duan-Yu Chen
Date : June 2, 1997
We have the following result
LEMMA 2. Let be a primitive element over GF(q) , where q is any prime greater than p-t . Let the entries of W be chosen so that
=
mod q . Consider any submatrix
of W consisting of any c consecutive rows of W. Then every c columns of
are linearly independent over the field of real numbers.
The following two corollaries follow immediately from Lemma 2.
Corollary 1. Every t columns of W are linearly independent over the field real numbers.
Corollary 2. C x C submatrix of if of full rank.
F : the set of faulty processors
P-:
,
,…,
| | =
|| =
0j
t-
-1
(4)
Dim()=(t-
) x
Both sides of (4) are premultiplied by to get the new system
è
(
Created by : Duan-Yu Chen
Date : June 2,1997
An LU decomposition of the matrix (
) is then performed!
If Rank() is
è
the matrix is of full rank .è
all unknown elements can be recovery by backsolves.() is full rank
è
In these cases , recovery from the pattern of faulty processors is possible!
n
We now examine in what cases the matrix (
Theorem 1. If only processors in P fail, then t faults can be tolerated by the algorithm for multiple fault recovery .
Theorem 2. The algorithm for recovery from multiple faults can tolerate any fault pattern involving 2(-1) or fewer processor.
Ü
In the event of
n
Once the corrupted data has been recoveredè
the computations of faulty processors à nonfaulty check processors.n
Expense of the recovery procedure :Surviving processors
à Computation processorsæ
Check processors
Data
à Computation processorsWeighted sums of the data
à Check processors
Created by : Duan-Yu Chen
Date : June 2 , 1997
No. of processors : 14
Required : 3 fault location & 2 fault recovery
Computation processors : ,
….,
Check processors : ,
,
Data distributed to computation processors:,
..,
Weighted checksums :,
,
14 processors grouped into two sets : processors 0~6 & 7~13
See Fig. 2.
Processor 11 and 12 form a pair of consecutive nonfaulty check processors.Considering only the checksums on processors 11 and 12
è
2+4
=
-(
+8
+3
+6
+12
+11
+9
+5
+10
)
Ü
Right hand side of the above equation consists of matrices computed by nonfaulty processors è
See Fig.3 Fig.4. Fig. 5. Fig. 6.
Ü
Develop algorithm-based schemes for fault location and recovery multiprocessor systems.
Created by : Duan-Yu Chen
Date : June 2, 1997
Ü
The weights chosen to compute the redundant data strips for the check processors are chosen from the entries in the parity check matrix of a Reed-Solomon code, and it is shown that this choice of weights guarantees recovery from 2(Created by : Duan-Yu Chen
Date : June 2,1997