uniq: -w counts UTF-8 characters instead of bytes

# `uniq -w` counts UTF-8 characters instead of bytes

## Component

uniq

## Description

GNU uniq's `-w N` (check-chars) option compares the first N **bytes** of each line, while uutils uniq compares the first N **UTF-8 characters**. This causes different behavior when processing multibyte characters (e.g., CJK characters, emoji).


[uutils uniq](https://github.com/uutils/coreutils/blob/75b3bc02eb80a5f07c708096e8a2ea4fd03190f2/src/uu/uniq/src/uniq.rs#L175-L181) always uses UTF-8 character-based comparison regardless of locale, using Rust's `.chars().take(N)` method.

```rust
let total_chars = string_after_skip.chars().count();

// `-w N` => Compare no more than N characters
let slice_stop = self.slice_stop.unwrap_or(total_chars);
let slice_start = slice_stop.min(total_chars);

let mut iter = string_after_skip.chars().take(slice_start);
```

## Test / Reproduction Steps

```bash
$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_ALL=
$ printf "가나다라마\n가나다바사\n" > korean.txt

# -w 3
$ uniq -w 3 korean.txt
가나다라마

$ coreutils uniq -w 3 korean.txt
가나다라마

# -w 4
$ uniq -w 4 korean.txt
가나다라마

$ coreutils uniq -w 4 korean.txt
가나다라마
가나다바사

# -w 12
$ uniq -w 12 korean.txt
가나다라마
가나다바사

$ coreutils uniq -w 12 korean.txt
가나다라마
가나다바사
```

## Impact

Scripts using `-w` with multibyte characters (CJK, emoji) produce different results between GNU and uutils.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

uniq: -w counts UTF-8 characters instead of bytes #10184

`uniq -w` counts UTF-8 characters instead of bytes

Component

Description

Test / Reproduction Steps

Impact

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

uniq: -w counts UTF-8 characters instead of bytes #10184

Description

uniq -w counts UTF-8 characters instead of bytes

Component

Description

Test / Reproduction Steps

Impact

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`uniq -w` counts UTF-8 characters instead of bytes