Compressor-tool is a C-based command-line project focused on learning and implementing file compression internals.
At the moment, the fully working path is Huffman compression and Huffman decompression. The high-compression mode is reserved for future work.
At its core, compression is about reducing redundancy. Instead of storing every symbol with equal cost (8 bits), we: Give shorter codes to frequent data, give longer codes to rare data, this minimizes total storage.
[INPUT FILE] ↓ [READ BYTES] ↓ [FREQUENCY COUNT] ↓ [BUILD TREE] ↓ [GENERATE CODES] ↓ [ENCODE → BITSTREAM] ↓ [PACK → BYTES] ↓ [WRITE COMPRESSED FILE]
[READ HEADER] ↓ [REBUILD TREE] ↓ [UNPACK BITS] ↓ [TRAVERSE TREE] ↓ [RECONSTRUCT BYTES] ↓ [WRITE ORIGINAL FILE]
Many compression tools feel like a black box. This codebase is useful for newcomers because it shows the full pipeline clearly:
- Read bytes from a file
- Build a frequency table
- Build a Huffman tree
- Convert bytes to variable-length bit codes
- Pack bits into bytes for storage
- Rebuild and decode the original bytes
If you want to understand real compression logic (not only use a library), this is a good implementation to study.
- Working: Huffman compression via -c -n
- Working: Huffman decompression via -d
- Not implemented yet: high compression mode via -c -h (placeholder)
compressor help compressor -help
compressor -c -n input_file_name.extension
compressor -c -h input_file_name.extension
compressor -d some_file.compressed
compressor -github
compressor -about
Notes:
- Compression output name is generated from the part before the first dot in the input name, then .compressed is appended.
- Decompression restores the original file name stored in the compressed header.
A fixed-width byte encoding uses 8 bits for every symbol, even if some symbols are very common.
Huffman coding creates shorter bit codes for frequent symbols and longer codes for rare symbols, reducing total size on average.
Huffman builds an optimal prefix-free binary code from symbol frequencies.
- Prefix-free means no code is the prefix of another code.
- Because of prefix-free property, decoding is unambiguous while scanning bits from left to right.
- Count each symbol frequency.
- Create one leaf node per symbol.
- Repeatedly merge the two lowest-frequency nodes into a parent node.
- Continue until one root remains (the Huffman tree).
- Assign bits by traversal: left = 0, right = 1.
- Replace each symbol by its bit code and concatenate.
The code maps theory to specific components:
| Pipeline Stage | Where It Happens |
|---|---|
| CLI argument routing | src/main.c |
| Frequency counting | src/helper.h -> calculateFrequencies() |
| Store unique symbol frequencies | src/DataStructures/frequency_array.h |
| Store all original bytes | src/DataStructures/data_array.h |
| Build Huffman tree | src/helper.h -> createHuffmanTree(), src/DataStructures/tree_array.h -> buildTree() |
| Generate per-symbol code paths | src/helper.h -> createEncoding() |
| Serialize tree to file | src/helper.h -> storeTree() |
| Pack bits into bytes | src/helper.h -> bitPacking() |
| Write encoded payload | src/helper.h -> storeDataBufferInEncoding() |
| Rebuild tree at decode time | src/helper.h -> rebuildTree(), deserializeTree() |
| Decode bitstream | src/helper.h -> decode(), generateOutput() |
The .compressed output is a custom binary container with this order:
- nameLength (size_t)
- originalFileName bytes (nameLength bytes)
- compressionFlag (uint8_t)
- treeLength (uint32_t)
- serializedTree bytes (treeLength bytes)
- validBitCount (uint32_t)
- packedEncoding bytes (ceil(validBitCount / 8) bytes)
- 0 -> Huffman
- 1 -> high-compression mode (reserved)
Preorder format:
- Internal node: write 0
- Leaf node: write 1, then write the symbol byte
Example pattern:
0 0 1 75 1 89 ...
During decompression, the program:
- Reads the original file name from header
- Reads compression flag
- If flag is 0, reads and rebuilds the Huffman tree
- Reads validBitCount and packed encoding bytes
- Unpacks bits and traverses tree to recover original bytes
- Writes bytes to restored output file name
High-level complexity for Huffman pipeline:
- Frequency pass: O(n), where n is input bytes
- Tree stage: depends on current merge/sort strategy
- Encode/decode passes: linear in produced bitstream length
Memory usage grows with:
- input size (stored bytes)
- number of unique symbols
- encoded bit buffer
README.md
src/
main.c
helper.h
input.txt
DataStructures/
data_array.h
frequency_array.h
my_string.h
node.h
pair.h
tree_array.h
- High-compression mode is currently a placeholder and not production-ready.
- Header uses native size_t and binary writes, so cross-architecture portability needs standardization (endianness and fixed-width metadata).
- Current approach is educational and clear, but not yet tuned for large-file performance.
- Implement true LZ77/LZMA mode for -c -h.
- Replace repeated tree traversal per symbol with a direct code table.
- Stream encode/decode with less intermediate buffering.
- Standardize portable on-disk format (fixed-width fields and defined endianness).
Contributions are welcome. Issues and pull requests are appreciated.