-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Closed
Labels
Description
uniq -w counts UTF-8 characters instead of bytes
Component
uniq
Description
GNU uniq's -w N (check-chars) option compares the first N bytes of each line, while uutils uniq compares the first N UTF-8 characters. This causes different behavior when processing multibyte characters (e.g., CJK characters, emoji).
uutils uniq always uses UTF-8 character-based comparison regardless of locale, using Rust's .chars().take(N) method.
let total_chars = string_after_skip.chars().count();
// `-w N` => Compare no more than N characters
let slice_stop = self.slice_stop.unwrap_or(total_chars);
let slice_start = slice_stop.min(total_chars);
let mut iter = string_after_skip.chars().take(slice_start);Test / Reproduction Steps
$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_ALL=
$ printf "가나다라마\n가나다바사\n" > korean.txt
# -w 3
$ uniq -w 3 korean.txt
가나다라마
$ coreutils uniq -w 3 korean.txt
가나다라마
# -w 4
$ uniq -w 4 korean.txt
가나다라마
$ coreutils uniq -w 4 korean.txt
가나다라마
가나다바사
# -w 12
$ uniq -w 12 korean.txt
가나다라마
가나다바사
$ coreutils uniq -w 12 korean.txt
가나다라마
가나다바사Impact
Scripts using -w with multibyte characters (CJK, emoji) produce different results between GNU and uutils.
Reactions are currently unavailable