Skip to content

lestring16/bestring16: UCS-2 string types #232

@unclesp1d3r

Description

@unclesp1d3r

Summary

Add support for the magic(5) lestring16 (little-endian) and bestring16 (big-endian) UCS-2 string types. These are 16-bit Unicode strings used by NTFS, FAT, and other Windows-derived formats.

Real-world need

/usr/share/file/magic/filesystems (the system magic file shipped on macOS and most Linux distros) uses lestring16 in three rules covering NTFS bootstrap loader names and volume names. Without this type, the file fails to load mid-parse:

1583: >>>0x002        lestring16  x   bootstrap %-5.5s
1586: >>>>0x0c        lestring16  x   \b%-2.2s
1782: >0x147c         lestring16  x   \b, volume name "%s"

This was discovered while fixing the UTF-8 / arithmetic-indirect-offset gaps in PR-on-branch fix/loader-non-utf8-magic-files. After those parser fixes, this is the next blocker for loading the system magic file end-to-end.

Spec

magic(5):

lestring16 -- A two-byte UCS string in little-endian byte order.
bestring16 -- A two-byte UCS string in big-endian byte order.

Each character occupies two bytes; the reader stops at U+0000 (encoded as the 2-byte sequence 0x00 0x00) or at end-of-buffer. Comparison values in magic files are ASCII; the evaluator decodes the file bytes to a Rust String and compares against the (ASCII) target.

Implementation outline

  1. AST -- New TypeKind::String16 { endian: Endianness } variant.
  2. Parser -- parse_type_keyword + type_keyword_to_kind accept lestring16 / bestring16. Reuse existing string-value parsing for the comparison operand.
  3. Evaluator -- New read_string16 in evaluator/types/string.rs. Reads pairs of bytes, decodes via char::from_u32, stops on NUL pair or buffer end. Variable-width, so add an explicit arm to bytes_consumed for relative-offset anchor advance.
  4. Codegen -- serialize_type_kind arm.
  5. Strength -- new arm in calculate_default_strength.
  6. Property tests / output -- per GOTCHAS S2.1.

Acceptance criteria

  • parse_type_keyword(\"lestring16\") and parse_type_keyword(\"bestring16\") round-trip.
  • Eval test exercises both endian variants with an ASCII target value.
  • Eval test exercises the x (any-value) operator with %s format substitution (the form actually used in the filesystems file).
  • Bytes-consumed test confirms relative-offset anchor advances over the read string.
  • /usr/share/file/magic/filesystems parses end-to-end via rmagic --magic-file.

Out of scope

  • BOM sniffing / endianness auto-detection.
  • Proper UTF-16 surrogate-pair handling (UCS-2 only covers the BMP; libmagic itself does not handle surrogates).
  • string16 without an endian prefix (libmagic does not define this; only lestring16 and bestring16 are valid keywords).

Refs

Metadata

Metadata

Assignees

No one assigned

    Labels

    compatibilitylibmagic compatibility and migrationenhancementNew feature or requestevaluatorRule evaluation engine and logicparserMagic file parsing components and grammarrustRust language features and idiomstestingTest infrastructure and coveragetype:feature

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions