A highly optimized x86_64 assembly program that transforms UTF-8 encoded text by applying polynomial transformations to Unicode characters.
Diakrytynizator reads UTF-8 text from standard input, applies a polynomial transformation to characters with Unicode values greater than 0x7F, and outputs the transformed text to standard output. ASCII characters (0x00-0x7F) remain unchanged.
The program accepts command-line arguments that define a polynomial:
./diakrytynizator a0 a1 a2 ... anThis defines the polynomial:
w(x) = an * x^n + ... + a2 * x^2 + a1 * x + a0
Transformation rule: For each Unicode character with value x > 0x7F, the program:
- Computes
w(x - 0x80) mod 0x10FF80 - Outputs the character with Unicode value
w(x - 0x80) + 0x80
The program strictly validates UTF-8 encoding:
- Accepts Unicode values from
0x00to0x10FFFF - Supports 1-4 byte UTF-8 sequences
- Rejects overlong encodings (only shortest form accepted)
- Returns exit code 1 on invalid input
Compile with NASM and link:
nasm -f elf64 -w+all -w+error -o diakrytynizator.o diakrytynizator.asm
ld --fatal-warnings -o diakrytynizator diakrytynizator.oecho "Zażółć gęślą jaźń…" | ./diakrytynizator 0 1
# Output: Zażółć gęślą jaźń…
# Exit code: 0echo "Zażółć gęślą jaźń…" | ./diakrytynizator 133
# Output: Zaąąąą gąąlą jaąąą
# Exit code: 0echo "ŁOŚ" | ./diakrytynizator 1075041 623420 1
# Output: „O"
# Exit code: 0echo -e "abc\n\x80" | ./diakrytynizator 7
# Output: abc
# Exit code: 1- Buffered I/O: Uses 1KB buffers for efficient reading and writing
- Modular arithmetic: All polynomial computations performed modulo
0x10FF80 - UTF-8 parsing: Hand-optimized byte-level UTF-8 decoding and encoding
- Parameter validation: Validates polynomial coefficients (non-negative integers, no leading zeros)
- Error handling: Comprehensive validation of input encoding and parameters
The program consists of several key components:
parse: Converts decimal string arguments to integers with validationconvert_args: Processes command-line arguments into polynomial coefficientscalc_poly: Evaluates the polynomial using Horner's methodutf_count_bytes: Determines UTF-8 character byte length from first byteutf_bytes_for_code: Calculates required bytes for a Unicode valueload_head_byte/load_tail_byte: UTF-8 decoderadd_char: UTF-8 encoder that writes transformed characters to output buffer
.data: Constants for UTF-8 masks, prefixes, and modulo value.bss: Input/output buffers (1KB each)- Stack-based polynomial coefficient storage
- 0: Successful execution
- 1: Error (invalid parameters, malformed UTF-8, or encoding violation)
This program was developed as an assignment for the Operating Systems course at the University of Warsaw in 2021.