Skip to content

uniq: -w counts UTF-8 characters instead of bytes #10184

@sylvestre

Description

@sylvestre

uniq -w counts UTF-8 characters instead of bytes

Component

uniq

Description

GNU uniq's -w N (check-chars) option compares the first N bytes of each line, while uutils uniq compares the first N UTF-8 characters. This causes different behavior when processing multibyte characters (e.g., CJK characters, emoji).

uutils uniq always uses UTF-8 character-based comparison regardless of locale, using Rust's .chars().take(N) method.

let total_chars = string_after_skip.chars().count();

// `-w N` => Compare no more than N characters
let slice_stop = self.slice_stop.unwrap_or(total_chars);
let slice_start = slice_stop.min(total_chars);

let mut iter = string_after_skip.chars().take(slice_start);

Test / Reproduction Steps

$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_ALL=
$ printf "가나다라마\n가나다바사\n" > korean.txt

# -w 3
$ uniq -w 3 korean.txt
가나다라마

$ coreutils uniq -w 3 korean.txt
가나다라마

# -w 4
$ uniq -w 4 korean.txt
가나다라마

$ coreutils uniq -w 4 korean.txt
가나다라마
가나다바사

# -w 12
$ uniq -w 12 korean.txt
가나다라마
가나다바사

$ coreutils uniq -w 12 korean.txt
가나다라마
가나다바사

Impact

Scripts using -w with multibyte characters (CJK, emoji) produce different results between GNU and uutils.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions