OST

The scripts are implementation of OST algorithm which is a novel encoding algorithm for data compression. The full description is available at https://www.biorxiv.org/content/10.1101/2020.08.24.264366v1.full. The name of encoding algorithm is after Anas Al-okaily, Pramod Srivastava, and Abdelghani Tbakhi (Okaily-Srivastava-Tbakhi).

Briefly and generally speaking, the algorithm starts by scanning the DNA data with non-overlapping windows, label the sequence within each window, concatenate it with the sequences in a bin correspondent to that label, and then output the label into a stream. Now, encode the labels of the bins based on, for instance, the number of the sequences in the bins; compress the stream of labels using the label codes. Then, compress the sequences in each bin using suitable compression algorithm depending on the content of the sequences in that bin. Finally, compress all the compression results (bins and stream of labels) together.

The decompression process will be by firstly decompressing each bin and the stream of labels. Then, read the labels sequentially from the stream of labels, at each reading and using a counter for each bin, get the next sequence (of length same as the length of the non-overlapping window used during the compression process) from the bin of that label then increment the counter at that bin.

The seven scripts OST-DNA-bcm, OST-DNA-brotli, OST-DNA-bsc, OST-DNA-bzip2, OST-DNA-lrzip, OST-DNA-lzip, and OST-DNA-xz are python scripts for applying OST algorithm on DNA data and where the correspondent tools (bcm, brotli, bsc, bzip2, lrzip, lzip, and xz) are used to compress the bins. For each tool a window length and label length are parameters. The description of these parameters are available at https://www.biorxiv.org/content/10.1101/2020.08.24.264366v1.full .

For each script and in order to validate the command used for compression the bins (line number 149 in each script) and decompression the bins (line number 233 in each script) or change them according to your need, please go inside the script to do so.

----------------------------------------------- Preparation --------------------------------------------------------------------------------------------

Firstly, you may convert the genome in fasta format to a one-line genome which remove any non A, C, G, T, and N (case is sensitive) and headers. This can be done using the script filter_DNA_file_to_4_bases_and_N.py by running the command:

python filter_DNA_file_to_4_bases_and_N.py $file.fasta

----------------------------------------------- Running the scripts ------------------------------------------------------------------------------------ Compression

python OST-DNA-xxx.py $genome $label_length $window_length

Note: huffman package must be installed, as the scripts import this package.

Decompression

python OST-DNA-xxx.py -d $genome.ost.7z

For contact, please email AA.12682@KHCC.JO (the email of the first author Anas Al-okaily).

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
Commands used to compress files for each tool.txt		Commands used to compress files for each tool.txt
Demo_genome_for_testing.fasta		Demo_genome_for_testing.fasta
LICENSE		LICENSE
OST_bcm.py		OST_bcm.py
OST_brotli.py		OST_brotli.py
OST_bsc.py		OST_bsc.py
OST_bzip2.py		OST_bzip2.py
OST_lrzip.py		OST_lrzip.py
OST_lzip.py		OST_lzip.py
OST_xz.py		OST_xz.py
README.md		README.md
filter_genome_file_to_4_bases_and_N.py		filter_genome_file_to_4_bases_and_N.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OST

About

Uh oh!

Releases

Packages

Languages

License

EyeCon/OST

Folders and files

Latest commit

History

Repository files navigation

OST

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages