Smiley face

Data Generation


Description

In the MLCS problem, We define the similarity of three input sequences T、 A and B, denoted as Simlarity(T,A,B) which shown in below.

formula of Similarity(A, B)  
The formula of Similarity(T, A, B).

Using the formula of Similarity, we present a random method to generate specific similarity string data.

Method

The pseudorandom datasets are generated with various lengths = {1000, 2000, 5000, 10000}, similarities = {10% ~ 100%}, alphabet sizes = {4, 64, 1000}, and ratio γ = |A| / |T| = m / r = {0.1, 0.2, 0.3, 0.4, 0.5}, where |A| = m, |B| = n, |T | = r, m + n = r.

Suppose that a dataset with similarity p is desired to be generated. Our method for randomly generating experimental data is presented as follows.


  1. Step 1:

    Randomly generate T . Then, randomly put each character of T into either A or B in order.

  2. Step 2:

    Compute the MLCS (BMLCS) of A, B and T , and calculate their similarity λ.

  3. Step 3:

    If λ ∈ [p − ε, p + ε], output A, B and T , and stop. Otherwise, randomly mutate some characters of T , and go to Step 2. Here, we set ε = 1%.


C++ source code


int Generate(int TSIZE, int ASIZE, int BSIZE, int size,int TS)
{
    int index=1,MLCSlength,j,k;
    int ChangNum=0, Count=0, temp;
    string str,data;

    ChangNum = TSIZE - (TSIZE*TS/100);
    while(1)
    {
        vector T (TSIZE+1);
        vector A (ASIZE+1);
        vector B (BSIZE+1);
        for(int i = 1; i <= TSIZE; i++)
            T[i]= rand()% size + 1;
        j = 1; k = 1;
        for(int i = 1; i <= TSIZE; i++)
        {
            if (rand() % 2 == 0 && j < ASIZE || k > BSIZE)
            {
                A[j] = T[i];
                j++;
            }
            else
            {
                B[k] = T[i];
                k++;
            }
        }
        //Reduce the similarity, mutate some characters of T 
        for (int i = 1; i <= ChangNum; ++i) {
            T[rand() % TSIZE + 1]=rand()% size+1;
        }
        
        MLCSlength=MLCS_Bitpar(T, A, B, TSIZE, ASIZE, BSIZE,size,64);
        cout << index<< ":" << MLCSlength << " SIZE:"<< TSIZE<< " "<< ASIZE << " "<< BSIZE << endl;
		
        if(MLCSlength >= TS*TSIZE/100 - TSIZE*0.01 && MLCSlength <= TS*TSIZE/100 + TSIZE*0.01)  // ε = 1%
        {
            allwrite_to_file(T, A, B,TSIZE, ASIZE, BSIZE,size,TS,index);
            index++;
            if(MLCSlength != TSIZE)
                ChangNum-=(5*TSIZE/1000);
            else
                ChangNum+=(10*TSIZE/1000);
        }
        else if( MLCSlength > TS*TSIZE/100+TSIZE*0.01)
        {
            ChangNum+=(15*TSIZE/1000);
        }
        else
        {
            ChangNum-=(15*TSIZE/1000);
        }

        if(index == 101) 
            break;
    }

    return 0;
}

The files
  GenerateData.cpp