Our Algorithm_FastSearch(2021)

The pairwise sequence alignment is an essential tool in bioinformatics. For finding similar protein sequences in a database, we can use the global and local sequence alignment algorithms. Since performing pairwise sequence alignment requires very much time for all sequences in the database, we may apply the heuristic algorithms to reduce the required time, but slightly to decrease the accuracy. This thesis proposes an algorithm to quickly pick up feasible sequences with better alignment scores in the database. The proposed algorithm FastSearch can be adjusted by setting parameters to speed up the sequence searching. The FastSearch is compared with the GGSEARCH based on the optimal alignment score by the Needleman-Wunsch algorithm. In addition, the comparisons of alignment score with the GGSEARCH and the ClustalW are performed for various sequence identities. The experimental data sets are retrieved from the NCBI database, including the COVID-19 proteins, the EUMAT and the Human proteins. As the experimental results show, our FastSearch algorithm gets better accuracy and time efficiency than the GGSEARCH. In the comparison of alignment score for various identities, our BestJump algorithm also gets comparable scores and requires less time than the GGSEARCH and the ClustalW.


import java.io.FileNotFoundException;
import java.io.FileWriter;
import java.io.IOException;
import java.util.Date;
public class Main {
    public static void main(String[] args) throws FileNotFoundException, IOException{
        //this file is the main important experiment code, do all full test
        String queryName = "query.txt"; // query 
        String databaseName = "database.txt"; // database
        String scoreName = "blosum62.txt";  // score matrix
        int gapScore = -4; // gap penalty g
        int fix_jump = 20; // d_fixed
        int seed_num = 25; // n_seed
        int firstStepFilterSplitSize = 5; // alpha
        int secondStepFilterSplitSize = 2; // beta
        int finalReturnSize = 5; // gamma
		int[][] firstStepPara = {{1,4,2},{1,1,1}}; // set weight parameters for FastLCS stage  
        int[][] secondStepPara = {{0,1,0},{0,0,1},{0,1,1},{1,1,16}}; // set weight parameters for BestJump stage
        FastSearch fs = new FastSearch(databaseName,scoreName); // create FastSearch object and read database and score matrix
        fs.set_fixed(fix_jump); // set d_fixed
        fs.set_seeds(seed_num); // set n_seed
        fs.set_nwpara(gapScore); // set gap penalty
        FastaRead queryDB = new FastaRead(queryName); // read query sequences
        int queryDB_length = queryDB.dataName.length;
        FileWriter fw = new FileWriter("NW-fiveReturn.csv"); // write return answer to the file
        
        /*
        Start to perform FastSearch algorithm for database sequences and each query
        we will input each query to the FastSearch algorithm, and return the names of similar protein sequences
        */
        for (int i = 0; i < queryDB_length; i++) {
            // get the query
            String query = queryDB.dataSequence[i];
            // perform the FastSearch algorithm for query and database, and get the names of similar protein sequences
            String[] get = fs.fastSearchDBwithReverse(query,firstStepFilterSplitSize,secondStepFilterSplitSize,firstStepPara,secondStepPara);
            get = fs.fullNWSearchDB(query);
            // write the return answer, data format is "query : ans1,ans2,ans3,ans4,ans5"
            fw.write(queryDB.dataName[i]+" : ");
            for (int j = 0; j < finalReturnSize; j++) {
                fw.write(get[j]+",");
            }
            fw.write("\n");
        }
        fw.flush();
        fw.close();        
    }
}