Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 52 additions & 10 deletions .serena/project.yml
Original file line number Diff line number Diff line change
@@ -1,13 +1,15 @@


# list of languages for which language servers are started; choose from:
# al bash clojure cpp csharp
# csharp_omnisharp dart elixir elm erlang
# fortran fsharp go groovy haskell
# java julia kotlin lua markdown
# matlab nix pascal perl php
# powershell python python_jedi r rego
# ruby ruby_solargraph rust scala swift
# terraform toml typescript typescript_vts vue
# yaml zig
# php_phpactor powershell python python_jedi r
# rego ruby ruby_solargraph rust scala
# swift terraform toml typescript typescript_vts
# vue yaml zig
# (This list may be outdated. For the current list, see values of Language enum here:
# https://github.com/oraios/serena/blob/main/src/solidlsp/ls_config.py
# For some languages, there are alternative language servers, e.g. csharp_omnisharp, ruby_solargraph.)
Expand All @@ -16,8 +18,8 @@
# - For JavaScript, use typescript
# - For Free Pascal/Lazarus, use pascal
# Special requirements:
# - csharp: Requires the presence of a .sln file in the project folder.
# - pascal: Requires Free Pascal Compiler (fpc) and optionally Lazarus.
# Some languages require additional setup/installations.
# See here for details: https://oraios.github.io/serena/01-about/020_programming-languages.html#language-servers
# When using multiple languages, the first language server that supports a given file will be used for that file.
# The first language is the default language and the respective language server will be used as a fallback.
# Note that when using the JetBrains backend, language servers are not used and this list is correspondingly ignored.
Expand All @@ -31,16 +33,19 @@ encoding: "utf-8"
# whether to use project's .gitignore files to ignore files
ignore_all_files_in_gitignore: true

# list of additional paths to ignore in all projects
# same syntax as gitignore, so you can use * and **
# list of additional paths to ignore in this project.
# Same syntax as gitignore, so you can use * and **.
# Note: global ignored_paths from serena_config.yml are also applied additively.
ignored_paths: []

# whether the project is in read-only mode
# If set to true, all editing tools will be disabled and attempts to use them will result in an error
# Added on 2025-04-18
read_only: false

# list of tool names to exclude. We recommend not excluding any tools, see the readme for more details.
# list of tool names to exclude.
# This extends the existing exclusions (e.g. from the global configuration)
#
# Below is the complete list of tools for convenience.
# To make sure you have the latest list of tools, and to view their descriptions,
# execute `uv run scripts/print_tool_overview.py`.
Expand Down Expand Up @@ -87,7 +92,8 @@ initial_prompt: ""
# the name by which the project can be referenced within Serena
project_name: "libmagic-rs"

# list of tools to include that would otherwise be disabled (particularly optional tools that are disabled by default)
# list of tools to include that would otherwise be disabled (particularly optional tools that are disabled by default).
# This extends the existing inclusions (e.g. from the global configuration).
included_optional_tools: []

# list of mode names to that are always to be included in the set of active modes
Expand All @@ -108,3 +114,39 @@ default_modes:
# fixed set of tools to use as the base tool set (if non-empty), replacing Serena's default set of tools.
# This cannot be combined with non-empty excluded_tools or included_optional_tools.
fixed_tools: []

# time budget (seconds) per tool call for the retrieval of additional symbol information
# such as docstrings or parameter information.
# This overrides the corresponding setting in the global configuration; see the documentation there.
# If null or missing, use the setting from the global configuration.
symbol_info_budget:

# The language backend to use for this project.
# If not set, the global setting from serena_config.yml is used.
# Valid values: LSP, JetBrains
# Note: the backend is fixed at startup. If a project with a different backend
# is activated post-init, an error will be returned.
language_backend:

# line ending convention to use when writing source files.
# Possible values: unset (use global setting), "lf", "crlf", or "native" (platform default)
# This does not affect Serena's own files (e.g. memories and configuration files), which always use native line endings.
line_ending:

# list of regex patterns which, when matched, mark a memory entry as read‑only.
# Extends the list from the global configuration, merging the two lists.
read_only_memory_patterns: []

# list of regex patterns for memories to completely ignore.
# Matching memories will not appear in list_memories or activate_project output
# and cannot be accessed via read_memory or write_memory.
# To access ignored memory files, use the read_file tool on the raw file path.
# Extends the list from the global configuration, merging the two lists.
# Example: ["_archive/.*", "_episodes/.*"]
ignored_memory_patterns: []

# advanced configuration option allowing to configure language server-specific options.
# Maps the language key to the options.
# Have a look at the docstring of the constructors of the LS implementations within solidlsp (e.g., for C# or PHP) to see which options are available.
# No documentation on options means no options are available.
ls_specific_settings: {}
30 changes: 9 additions & 21 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -209,32 +209,19 @@ cargo test --doc # Test documentation examples
- **Operators**: `=` (equal), `!=` (not equal), `<` (less than), `>` (greater than), `<=` (less equal), `>=` (greater equal), `&` (bitwise AND with optional mask), `^` (bitwise XOR), `~` (bitwise NOT), `x` (any value)
- **Nested Rules**: Hierarchical rule evaluation with proper indentation
- **String Matching**: Exact string matching with null-termination and Pascal string (length-prefixed) support
- **Regex type**: Binary-safe regex matching via `regex::bytes::Regex`. Full flag support: `/c` (case-insensitive), `/s` (anchor advances to match-start instead of match-end), `/l` (scan window is measured in lines instead of bytes). Flags combine in any order (`regex/cs`, `regex/csl`, `regex/lc`). Numeric counts are honored: `regex/100` scans at most 100 bytes; `regex/1l` scans at most 1 line. Multi-line regex matching is always on (matching libmagic's unconditional `REG_NEWLINE`), so `^` and `$` match at line boundaries regardless of `/l`. Every scan window is capped at 8192 bytes (`FILE_REGEX_MAX`) regardless of the user's count.
- **Search type**: Bounded literal pattern scan via `memchr::memmem::find`; `search/N` caps the scan window to `N` bytes from the offset. The range is **mandatory** and stored as `NonZeroUsize`, so bare `search` and `search/0` are parse errors (matching GNU `file` magic(5)). Anchor advance follows GNU `file` semantics (match-end, not window-end) so relative-offset children resolve to the byte immediately after the matched pattern.

### Planned Features (v1.0+)

- Regex type: Pattern matching with binary-safe regex support
- Search type: Multi-pattern string searching

### Future Enhancement: Binary-Safe Regex Handling

> **Note:** The following is planned for future releases and is not yet implemented.

```rust
// Use regex crate with bytes feature for binary-safe matching
pub trait BinaryRegex {
fn find_at(&self, haystack: &[u8], start: usize) -> Option<Match>;
}

impl BinaryRegex for regex::bytes::Regex {
/* ... */
}
```
- Aho-Corasick multi-pattern search optimization for `search/` rules.
- `!:mime`/`!:ext`/`!:apple` directive evaluation (currently only `!:strength` is parsed).
- `use`/`name` named test directives for rule reuse.

## Current Limitations (v0.1.0)

### Type System

- No regex/search pattern matching
- 64-bit integer types: `quad`/`uquad`, `bequad`/`ubequad`, `lequad`/`ulequad` are implemented; `qquad` (128-bit) is not yet supported
- String evaluation reads until first NUL or end-of-buffer by default; `pstring` reads a length-prefixed Pascal string; `max_length: Some(_)` is supported internally but no dedicated fixed-length string parser syntax exists yet
- `pstring` supports 1-byte (`/B`), 2-byte big-endian (`/H`), 2-byte little-endian (`/h`), 4-byte big-endian (`/L`), and 4-byte little-endian (`/l`) length prefixes, plus the `/J` flag (stored length includes prefix width). All flags are combinable (e.g., `pstring/HJ`) and fully implemented.
Expand Down Expand Up @@ -317,7 +304,7 @@ sample.bin: ELF 64-bit LSB executable, x86-64, version 1 (SYSV)

### Adding New Type Support

> **Note:** Currently implemented types are `Byte`, `Short`, `Long`, `Quad`, `Float`, `Double`, `Date`, `QDate`, `String`, and `PString`. Regex and search types are planned for future releases.
> **Note:** Currently implemented types are `Byte`, `Short`, `Long`, `Quad`, `Float`, `Double`, `Date`, `QDate`, `String`, `PString`, `Regex`, and `Search`. See "Current Limitations" for the remaining gaps in regex/search flag coverage.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

AGENTS still presents regex/search work as incomplete.

Line 307 refers readers to “remaining gaps in regex/search flag coverage”, and Lines 462-463 still put regex flag completeness/search range enforcement in a future phase. That now conflicts with the implementation this PR adds, so future contributors will get mixed guidance from the same document.

Also applies to: 462-463

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@AGENTS.md` at line 307, Update the AGENTS.md text that claims regex/search
coverage is incomplete: locate the sentence "Note: Currently implemented types
are `Byte`, `Short`, `Long`, `Quad`, `Float`, `Double`, `Date`, `QDate`,
`String`, `PString`, `Regex`, and `Search`" and the later lines mentioning regex
flag completeness/search range enforcement (the "Current
Limitations"/future-phase notes) and revise them to reflect that `Regex` and
`Search` support has been implemented by this PR—remove or reword the TODO
language so the document consistently states that regex/search flags and range
enforcement are implemented, and update the "Current Limitations" section to
only list remaining true gaps.


1. Extend `TypeKind` enum in `src/parser/ast.rs`
2. Add keyword parsing in `src/parser/types.rs` (`parse_type_keyword` and `type_keyword_to_kind`)
Expand Down Expand Up @@ -464,14 +451,15 @@ CI must pass before merge. Mergify merge protections enforce these checks. Bot P
- `nom`: Parser combinators
- `serde`: Serialization
- `clap`: CLI argument parsing
- `regex`: Pattern matching (used in tests; regex *type* for magic rules is planned)
- `regex`: Binary-safe pattern matching via `regex::bytes::Regex` for `TypeKind::Regex` evaluation
- `memchr`: SIMD-accelerated literal pattern search, used for `TypeKind::Search`
- `aho-corasick`: Multi-pattern search (planned, not yet added)

### Development Phases

1. **MVP (v0.1.0)** - CURRENT: Basic parsing and evaluation with byte/short/long/quad/string types, equality and bitwise AND operators, built-in rules for 10 common formats
2. **Enhanced Features (v0.2)**: Comparison operators (`>`, `<`), indirect offset improvements, strength-based rule ordering
3. **Advanced Types (v0.3)**: Regex type, search patterns
3. **Advanced Types (v0.3)**: Regex flag completeness (`/s`, proper `/l` line-count semantics, `regex/Nl`), search range enforcement, 8192-byte default regex range
4. **Full Compatibility (v0.4)**: Complete libmagic syntax support, all special directives, named tests
5. **Production Ready (v1.0)**: Stable API, complete documentation, 95%+ compatibility with GNU file

Expand Down
4 changes: 2 additions & 2 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -147,13 +147,14 @@ cfg-if = "1.0.4"
chrono = { version = "0.4.41", default-features = false, features = ["std", "clock"] }
clap = { version = "4.6.0", features = ["derive"] }
clap-stdin = "0.8.1"
clap_complete = "4.6.0"
clap_complete = "4.6.1"
ctrlc = { version = "3.5.2", features = ["termination"] }
env_logger = "0.11"
log = "0.4"
memchr = "2.8.0"
memmap2 = "0.9.10"
nom = "8.0.0"
regex = "1.12.3"
serde = { version = "1.0.228", features = ["derive"] }
serde_json = "1.0.149"
thiserror = "2.0.18"
Expand All @@ -171,7 +172,6 @@ insta = { version = "1.47.2", features = ["json"] }
nix = { version = "0.31.2", features = ["fs"] }
predicates = "3.1.4"
proptest = "1.11.0"
regex = "1.12.3"
tempfile = "3.27.0"

[[bench]]
Expand Down
Loading
Loading