refactor: overhaul AGENTS.md with PR review insights#6103
Conversation
Restructure and enhance AGENTS.md files based on analysis of ~1000 PR reviews. Consolidate overlapping sections in root AGENTS.md, add directory-specific guidelines for rust/, java/, python/, protos/, and docs/src/format/, integrating coding standards extracted from recurring review patterns.
Review of PR #6103Documentation-only PR — no functional code changes. Two issues worth flagging: P1: unsafe set_len() guidance in rust/AGENTS.md is under-specified The lance-encoding section recommends:
This guideline recommends UB-prone unsafe code without adequate guardrails. While the codebase does use this pattern (e.g., lance-io/src/local.rs:195 with BytesMut), it requires a // Safety: comment and careful justification each time. The guideline as written could lead contributors to reach for unsafe set_len() too casually. Consider either:
P1: rust/examples/ dropped from architecture listing The rust/examples/ directory was listed in the old AGENTS.md but is absent from the new architecture section. If it still exists and is relevant, it should be included. |
…e conflict - protos/AGENTS.md: rewrite optional guidance to accurately describe proto3 presence semantics (no presence vs has_* tracking) instead of incorrectly framing it as required vs optional - rust/AGENTS.md: scope "delete obsolete methods" rule to internal (pub(crate)/private) methods only, deferring to root AGENTS.md deprecation path for public API methods
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
westonpace
left a comment
There was a problem hiding this comment.
I really like this collection. It feels almost like defining our own custom linter. It's like we have static analysis, dynamic analysis, and a new layer: inferential analysis.
| - Use `PrimitiveArray::<T>::from(vec)` (zero-copy) instead of `from_iter_values(vec)` for Vec-to-PrimitiveArray conversion. | ||
| - Implement `Default` trait on config/options structs instead of standalone `default_*()` helpers. | ||
| - Place `#[cfg(test)] mod tests` as a single block at the bottom of each file — no production code after it. | ||
| - Place `use` imports at the top of the file, not inline within function bodies. |
| - Use strongly-typed structs instead of `HashMap<String, String>` in APIs — convert to strings only at serialization boundaries. | ||
| - Keep `RowAddr` (physical fragment+offset) and `RowId` (stable logical identifier) as distinct types — never raw `u64` for both. | ||
| - Use `RowAddress` from `lance-core/src/utils/address.rs` instead of raw bitwise operations on row addresses. | ||
| - Use `RowAddrTreeMap`/`RoaringBitmap` instead of `Vec<Range<u64>>` for physical row selections. |
There was a problem hiding this comment.
I think this one is a bit of an "it depends" but it is a reasonable default
| - Keep traits minimal — only core abstraction methods. Move helpers to standalone functions and config to struct fields. | ||
| - Get column/field types from schema metadata — never materialize data rows just to inspect types. | ||
| - Use stable, versioned serialization formats for persistent storage (e.g., index files) — avoid unstable cross-version formats. | ||
| - Convert `LargeBinary`, `LargeUtf8`, `Utf8View`, `BinaryView` to `Large*` variants, never to `Utf8`/`Binary` (i32 offset overflow risk). |
There was a problem hiding this comment.
I'm not entirely sure about this rule.
| - Get column/field types from schema metadata — never materialize data rows just to inspect types. | ||
| - Use stable, versioned serialization formats for persistent storage (e.g., index files) — avoid unstable cross-version formats. | ||
| - Convert `LargeBinary`, `LargeUtf8`, `Utf8View`, `BinaryView` to `Large*` variants, never to `Utf8`/`Binary` (i32 offset overflow risk). | ||
| - Use Arrow's type-safe access (`ArrayAccessor` trait bounds, `as_*_array` helpers) instead of `arrow::compute::cast` + `downcast_ref`. |
There was a problem hiding this comment.
Maybe even prefer the _opt variants unless the data type has already been verified.
| - Hoist loop-invariant conditionals out of hot loops — branch once outside, then use separate loop bodies or monomorphized variants. | ||
| - Pre-allocate single contiguous buffers; prefer `Vec::with_capacity` + `unsafe { set_len() }` over `extend` with dummy values when the buffer will be immediately overwritten. | ||
| - Use `spawn_cpu()` only at the async-to-CPU boundary (e.g., FSST, decompression, batch materialization) — never nest redundant `spawn_cpu()` calls. | ||
| - Use `OffsetView` instead of `borrow_to_typed_slice` for typed access to byte buffers — avoids cloning the entire buffer. |
There was a problem hiding this comment.
This rule is odd. borrow_to_typed_slice should not copy the buffer. We can probably strike this one.
| ### Dependencies | ||
|
|
||
| - Keep `Cargo.lock` changes intentional; revert unrelated dependency bumps. Pin broken deps with a comment linking the upstream issue. | ||
| - Gate optional/domain-specific deps behind Cargo feature flags. Prefer separate crates for domain functionality (geo, NLP). |
There was a problem hiding this comment.
Maybe add another rule about avoiding extra dependencies if possible to keep the crates light? I've noticed that claude is sometimes a little eager to grab external crates.
| - **All bugfixes and features must have corresponding tests. We do not merge code without tests.** | ||
| - Use `rstest` (Rust) or `@pytest.mark.parametrize` (Python) for tests that differ only in inputs. Use `#[case::{name}(...)]` for readable case names. | ||
| - Replace `print()` in tests with `assert` — prints don't catch regressions. | ||
| - Extend existing tests instead of adding overlapping new ones. Add to existing `test_{module}.py` files. |
There was a problem hiding this comment.
Small nit, but I think this particular one shouldn't be language specific to Python.
| - Extend existing tests instead of adding overlapping new ones. Add to existing `test_{module}.py` files. | |
| - Extend existing tests instead of adding overlapping new ones. Add to existing test files. |
- rust/AGENTS.md: add safety guardrails for unsafe set_len() guidance, default to buf.resize() and reserve unsafe for measured hot paths - rust/AGENTS.md: remove incorrect OffsetView vs borrow_to_typed_slice rule (borrow_to_typed_slice does not copy) - rust/AGENTS.md: remove LargeBinary/LargeUtf8 conversion rule (too prescriptive per maintainer feedback) - rust/AGENTS.md: add _opt variant preference for Arrow type-safe access - AGENTS.md: restore rust/examples/ in architecture listing - AGENTS.md: add rule to prefer std/existing deps before new crates - AGENTS.md: make "extend existing tests" rule language-agnostic
All rules are derived from analysis of ~1000 PR reviews to capture recurring review patterns as actionable guidelines.
Changes are as following:
AGENTS.md: consolidate 3 overlapping overview sections into one, merge "Key Technical Details" + "Development Notes" + "Development tips" into organized Coding/Testing/Documentation Standards sections, deduplicate Python/Java commands (now link to subdirectory files)rust/AGENTS.mdwith ~66 Rust-specific rules covering code style, API design, error handling, naming, testing, documentation, and lance-encoding hot path patternsjava/AGENTS.mdwith API design (Options pattern, JNI enum serialization), code style (JavaBean conventions), and documentation rulespython/AGENTS.mdwith Pythonic API design, PyO3 dataclass rules, type hints, and testing patternsprotos/AGENTS.mdwith proto3optionalsemantics, structured message design, and documentation rulesdocs/src/format/AGENTS.mdwith format spec documentation standards (pyarrow schemas, language-agnostic definitions, algorithm detail requirements)