Real protein Sequences Dataset :
The data sets are obtained from the National Center for Biotechnology Information (NCBI) site.
The format of each data set is the FASTA format.
An example of FASTA format of human protein sequences "pdb|1HNL|A" is shown as follow.
In this example, pdb|1HNL|A is the protein name, and the next lines after ">" are the sequences of protein.
>"name" "other information" sequence... |
---|
>pdb|1HNL|A Chain A, Human Lysozyme KVFERCELARTLKRLGMDGYRGISLANWMCLAKWESGYNTRATNYNAGDRSTDYGIFQINSRYWCNDGKT PGAVNAAHLSCSALLQDNIADAVACAKRVVRDPQGIRAWVAWRNRCQNRDVRQYVQGCGV |
Download | data set | query size | database size |
---|---|---|---|
download | COVID-19 | 100 | 5231 |
download | EUMAT | 100 | 36398 |
download | Human1 to Human5 | 50 | 3000 |
download | Each identity pairs | 1 | 1 |