Skip to content

EyeCon/OST

 
 

Repository files navigation

OST

The scripts are implementation of OST algorithm which is a novel encoding algorithm for data compression. The full description is available at https://www.biorxiv.org/content/10.1101/2020.08.24.264366v1.full. The name of encoding algorithm is after Anas Al-okaily, Pramod Srivastava, and Abdelghani Tbakhi (Okaily-Srivastava-Tbakhi).

Briefly and generally speaking, the algorithm starts by scanning the DNA data with non-overlapping windows, label the sequence within each window, concatenate it with the sequences in a bin correspondent to that label, and then output the label into a stream. Now, encode the labels of the bins based on, for instance, the number of the sequences in the bins; compress the stream of labels using the label codes. Then, compress the sequences in each bin using suitable compression algorithm depending on the content of the sequences in that bin. Finally, compress all the compression results (bins and stream of labels) together.

The decompression process will be by firstly decompressing each bin and the stream of labels. Then, read the labels sequentially from the stream of labels, at each reading and using a counter for each bin, get the next sequence (of length same as the length of the non-overlapping window used during the compression process) from the bin of that label then increment the counter at that bin.

The seven scripts OST-DNA-bcm, OST-DNA-brotli, OST-DNA-bsc, OST-DNA-bzip2, OST-DNA-lrzip, OST-DNA-lzip, and OST-DNA-xz are python scripts for applying OST algorithm on DNA data and where the correspondent tools (bcm, brotli, bsc, bzip2, lrzip, lzip, and xz) are used to compress the bins. For each tool a window length and label length are parameters. The description of these parameters are available at https://www.biorxiv.org/content/10.1101/2020.08.24.264366v1.full .

For each script and in order to validate the command used for compression the bins (line number 149 in each script) and decompression the bins (line number 233 in each script) or change them according to your need, please go inside the script to do so.

----------------------------------------------- Preparation --------------------------------------------------------------------------------------------

Firstly, you may convert the genome in fasta format to a one-line genome which remove any non A, C, G, T, and N (case is sensitive) and headers. This can be done using the script filter_DNA_file_to_4_bases_and_N.py by running the command:

python filter_DNA_file_to_4_bases_and_N.py $file.fasta

----------------------------------------------- Running the scripts ------------------------------------------------------------------------------------ Compression

python OST-DNA-xxx.py $genome $label_length $window_length

Note: huffman package must be installed, as the scripts import this package.

Decompression

python OST-DNA-xxx.py -d $genome.ost.7z

For contact, please email AA.12682@KHCC.JO (the email of the first author Anas Al-okaily).

About

DNA compressor

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%