To illustrate how the real execution time is affected by similarity, length and alphabet size of input sequences, we perform some simulation experiments on random data. We implemented some of the algorithms by C++ on Code:Block 12.3 software with GCC 4.8.1 compiler and tested on a computer with 64-bit Windows 7 operating system, CPU clock rate of 3.60GHZ (Intel(R) Core(TM) i7-4790) and 4 GB of RAM.
Here, the experiment demonstration involves three edit distance algorithms, Wagn, Myers and Wu , and involve LCS algorithms HirsA, HirsB, HirsC, Hunt and Alli. In all experiments, we set the cost of each character insertion, deletion and replacement to 1, 1 and 2, respectively, for the edit distance algorithm.
For consistent presentation of our experimental results, we define a 4-tuple T(m, n, |Σ|, alg), where |A| = m , |B| = n, n/m = {1, 2, 5, 10} (the ratio), |Σ|= {2, 4, 20, 64, 256, 1000} and alg is the algorithm name. For example, T(500, 2, *, Myers) is used to illustrate the experimental results of algorithm Myers, with m = 500, n = 500*2 = 600 and |Σ|= {2, 4, 20, 64, 256, 1000}.
The performance of the Alli algorithm is outstanding in general cases, since it is a parallel method with the bit-vector string representation. The Wu algorithm performs very well when the similarity is high. The gap between the Myers and Wu algorithms becomes large when n=m is large, because more insertions are required in Myers with large n=m. HirsC is worse than HirsA and HirsA, since some extra cost is required in the divide-and-conquer process of HirsC. Hunt do not performs well when the similarity is low as we think, since low similarity in our definition is not equivalent to a small number of matched pairs. For example, in our data set, the average difference in the number of matches with Similarity(A,B) = 0.55 and Similarity(A,B) = 0.95 is 4%, where n = m = 100, |Σ| = 4.
Name | year | Author(s) | Time complexity | Keywords |
---|---|---|---|---|
Edit distance algorithms | ||||
Wagn | 1974 | Wagner and Fischer | O(mn) | DP |
Myers | 1986 | Myers | O(nd) | shortest edit script, contour |
Wu | 1990 | Wu et al. | O(np) | shortest edit script, contour | LCS algorithms |
HirsA | 1975 | Hirschberg | O(mn) | DP |
HirsB | 1975 | Hirschberg | O(mn) | linear space |
HirsC | 1975 | Hirschberg | O(mn) | divide and conquer |
Hunt | 1977 | Hunt and Szymanski | O((r+n)log n) | match list, binary search |
Alli | 1986 | Allison and Dix | O(mn/w) | bit string, parallel |
Each figure shows the execution time of an algrithm with length of string A(m) =500, length of string B(n)=500 and alphabet size = {2,4,20,64,256,1000}.
Each figure shows the execution time of an algrithm with length of string A(m) =500, length of string B(n)={500,1000,2500,5000} and alphabet size = 1000.
Each figure shows the execution time of all algrithms of length of string A(m) =500 and alphabet size = 1000 with one of length of string B(n)={500,1000,2500,5000}.