Skip to content

refactor: overhaul AGENTS.md with PR review insights#6103

Merged
Xuanwo merged 3 commits intomainfrom
refactor/agents-md-overhaul
Mar 5, 2026
Merged

refactor: overhaul AGENTS.md with PR review insights#6103
Xuanwo merged 3 commits intomainfrom
refactor/agents-md-overhaul

Conversation

@Xuanwo
Copy link
Copy Markdown
Collaborator

@Xuanwo Xuanwo commented Mar 5, 2026

All rules are derived from analysis of ~1000 PR reviews to capture recurring review patterns as actionable guidelines.

Changes are as following:

  • Restructure root AGENTS.md: consolidate 3 overlapping overview sections into one, merge "Key Technical Details" + "Development Notes" + "Development tips" into organized Coding/Testing/Documentation Standards sections, deduplicate Python/Java commands (now link to subdirectory files)
  • Create rust/AGENTS.md with ~66 Rust-specific rules covering code style, API design, error handling, naming, testing, documentation, and lance-encoding hot path patterns
  • Enhance java/AGENTS.md with API design (Options pattern, JNI enum serialization), code style (JavaBean conventions), and documentation rules
  • Enhance python/AGENTS.md with Pythonic API design, PyO3 dataclass rules, type hints, and testing patterns
  • Enhance protos/AGENTS.md with proto3 optional semantics, structured message design, and documentation rules
  • Create docs/src/format/AGENTS.md with format spec documentation standards (pyarrow schemas, language-agnostic definitions, algorithm detail requirements)

Restructure and enhance AGENTS.md files based on analysis of ~1000 PR
reviews. Consolidate overlapping sections in root AGENTS.md, add
directory-specific guidelines for rust/, java/, python/, protos/, and
docs/src/format/, integrating coding standards extracted from recurring
review patterns.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 5, 2026

Review of PR #6103

Documentation-only PR — no functional code changes. Two issues worth flagging:

P1: unsafe set_len() guidance in rust/AGENTS.md is under-specified

The lance-encoding section recommends:

Pre-allocate single contiguous buffers; prefer Vec::with_capacity + unsafe { set_len() } over extend with dummy values when the buffer will be immediately overwritten.

This guideline recommends UB-prone unsafe code without adequate guardrails. While the codebase does use this pattern (e.g., lance-io/src/local.rs:195 with BytesMut), it requires a // Safety: comment and careful justification each time. The guideline as written could lead contributors to reach for unsafe set_len() too casually. Consider either:

  • Adding a qualifier like "requires a // Safety: justification and must be immediately followed by a write that fully initializes the buffer (e.g., read_exact)"
  • Or replacing with the safer buf.resize(len, 0) as the default recommendation, reserving set_len() for measured hot paths only

P1: rust/examples/ dropped from architecture listing

The rust/examples/ directory was listed in the old AGENTS.md but is absent from the new architecture section. If it still exists and is relevant, it should be included.

…e conflict

- protos/AGENTS.md: rewrite optional guidance to accurately describe
  proto3 presence semantics (no presence vs has_* tracking) instead of
  incorrectly framing it as required vs optional
- rust/AGENTS.md: scope "delete obsolete methods" rule to internal
  (pub(crate)/private) methods only, deferring to root AGENTS.md
  deprecation path for public API methods
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 5, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this collection. It feels almost like defining our own custom linter. It's like we have static analysis, dynamic analysis, and a new layer: inferential analysis.

Comment thread rust/AGENTS.md
- Use `PrimitiveArray::<T>::from(vec)` (zero-copy) instead of `from_iter_values(vec)` for Vec-to-PrimitiveArray conversion.
- Implement `Default` trait on config/options structs instead of standalone `default_*()` helpers.
- Place `#[cfg(test)] mod tests` as a single block at the bottom of each file — no production code after it.
- Place `use` imports at the top of the file, not inline within function bodies.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hurray 😆

Comment thread rust/AGENTS.md
- Use strongly-typed structs instead of `HashMap<String, String>` in APIs — convert to strings only at serialization boundaries.
- Keep `RowAddr` (physical fragment+offset) and `RowId` (stable logical identifier) as distinct types — never raw `u64` for both.
- Use `RowAddress` from `lance-core/src/utils/address.rs` instead of raw bitwise operations on row addresses.
- Use `RowAddrTreeMap`/`RoaringBitmap` instead of `Vec<Range<u64>>` for physical row selections.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this one is a bit of an "it depends" but it is a reasonable default

Comment thread rust/AGENTS.md Outdated
- Keep traits minimal — only core abstraction methods. Move helpers to standalone functions and config to struct fields.
- Get column/field types from schema metadata — never materialize data rows just to inspect types.
- Use stable, versioned serialization formats for persistent storage (e.g., index files) — avoid unstable cross-version formats.
- Convert `LargeBinary`, `LargeUtf8`, `Utf8View`, `BinaryView` to `Large*` variants, never to `Utf8`/`Binary` (i32 offset overflow risk).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure about this rule.

Comment thread rust/AGENTS.md Outdated
- Get column/field types from schema metadata — never materialize data rows just to inspect types.
- Use stable, versioned serialization formats for persistent storage (e.g., index files) — avoid unstable cross-version formats.
- Convert `LargeBinary`, `LargeUtf8`, `Utf8View`, `BinaryView` to `Large*` variants, never to `Utf8`/`Binary` (i32 offset overflow risk).
- Use Arrow's type-safe access (`ArrayAccessor` trait bounds, `as_*_array` helpers) instead of `arrow::compute::cast` + `downcast_ref`.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe even prefer the _opt variants unless the data type has already been verified.

Comment thread rust/AGENTS.md Outdated
- Hoist loop-invariant conditionals out of hot loops — branch once outside, then use separate loop bodies or monomorphized variants.
- Pre-allocate single contiguous buffers; prefer `Vec::with_capacity` + `unsafe { set_len() }` over `extend` with dummy values when the buffer will be immediately overwritten.
- Use `spawn_cpu()` only at the async-to-CPU boundary (e.g., FSST, decompression, batch materialization) — never nest redundant `spawn_cpu()` calls.
- Use `OffsetView` instead of `borrow_to_typed_slice` for typed access to byte buffers — avoids cloning the entire buffer.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rule is odd. borrow_to_typed_slice should not copy the buffer. We can probably strike this one.

Comment thread AGENTS.md
### Dependencies

- Keep `Cargo.lock` changes intentional; revert unrelated dependency bumps. Pin broken deps with a comment linking the upstream issue.
- Gate optional/domain-specific deps behind Cargo feature flags. Prefer separate crates for domain functionality (geo, NLP).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add another rule about avoiding extra dependencies if possible to keep the crates light? I've noticed that claude is sometimes a little eager to grab external crates.

Comment thread AGENTS.md Outdated
- **All bugfixes and features must have corresponding tests. We do not merge code without tests.**
- Use `rstest` (Rust) or `@pytest.mark.parametrize` (Python) for tests that differ only in inputs. Use `#[case::{name}(...)]` for readable case names.
- Replace `print()` in tests with `assert` — prints don't catch regressions.
- Extend existing tests instead of adding overlapping new ones. Add to existing `test_{module}.py` files.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit, but I think this particular one shouldn't be language specific to Python.

Suggested change
- Extend existing tests instead of adding overlapping new ones. Add to existing `test_{module}.py` files.
- Extend existing tests instead of adding overlapping new ones. Add to existing test files.

- rust/AGENTS.md: add safety guardrails for unsafe set_len() guidance,
  default to buf.resize() and reserve unsafe for measured hot paths
- rust/AGENTS.md: remove incorrect OffsetView vs borrow_to_typed_slice
  rule (borrow_to_typed_slice does not copy)
- rust/AGENTS.md: remove LargeBinary/LargeUtf8 conversion rule (too
  prescriptive per maintainer feedback)
- rust/AGENTS.md: add _opt variant preference for Arrow type-safe access
- AGENTS.md: restore rust/examples/ in architecture listing
- AGENTS.md: add rule to prefer std/existing deps before new crates
- AGENTS.md: make "extend existing tests" rule language-agnostic
@Xuanwo Xuanwo merged commit ab1a1c1 into main Mar 5, 2026
@Xuanwo Xuanwo deleted the refactor/agents-md-overhaul branch March 5, 2026 16:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants