refactor: overhaul AGENTS.md with PR review insights by Xuanwo · Pull Request #6103 · lance-format/lance

Xuanwo · 2026-03-05T09:47:34Z

All rules are derived from analysis of ~1000 PR reviews to capture recurring review patterns as actionable guidelines.

Changes are as following:

Restructure root AGENTS.md: consolidate 3 overlapping overview sections into one, merge "Key Technical Details" + "Development Notes" + "Development tips" into organized Coding/Testing/Documentation Standards sections, deduplicate Python/Java commands (now link to subdirectory files)
Create rust/AGENTS.md with ~66 Rust-specific rules covering code style, API design, error handling, naming, testing, documentation, and lance-encoding hot path patterns
Enhance java/AGENTS.md with API design (Options pattern, JNI enum serialization), code style (JavaBean conventions), and documentation rules
Enhance python/AGENTS.md with Pythonic API design, PyO3 dataclass rules, type hints, and testing patterns
Enhance protos/AGENTS.md with proto3 optional semantics, structured message design, and documentation rules
Create docs/src/format/AGENTS.md with format spec documentation standards (pyarrow schemas, language-agnostic definitions, algorithm detail requirements)

Restructure and enhance AGENTS.md files based on analysis of ~1000 PR reviews. Consolidate overlapping sections in root AGENTS.md, add directory-specific guidelines for rust/, java/, python/, protos/, and docs/src/format/, integrating coding standards extracted from recurring review patterns.

github-actions · 2026-03-05T09:49:41Z

Review of PR #6103

Documentation-only PR — no functional code changes. Two issues worth flagging:

P1: unsafe set_len() guidance in rust/AGENTS.md is under-specified

The lance-encoding section recommends:

Pre-allocate single contiguous buffers; prefer Vec::with_capacity + unsafe { set_len() } over extend with dummy values when the buffer will be immediately overwritten.

This guideline recommends UB-prone unsafe code without adequate guardrails. While the codebase does use this pattern (e.g., lance-io/src/local.rs:195 with BytesMut), it requires a // Safety: comment and careful justification each time. The guideline as written could lead contributors to reach for unsafe set_len() too casually. Consider either:

Adding a qualifier like "requires a // Safety: justification and must be immediately followed by a write that fully initializes the buffer (e.g., read_exact)"
Or replacing with the safer buf.resize(len, 0) as the default recommendation, reserving set_len() for measured hot paths only

P1: rust/examples/ dropped from architecture listing

The rust/examples/ directory was listed in the old AGENTS.md but is absent from the new architecture section. If it still exists and is relevant, it should be included.

…e conflict - protos/AGENTS.md: rewrite optional guidance to accurately describe proto3 presence semantics (no presence vs has_* tracking) instead of incorrectly framing it as required vs optional - rust/AGENTS.md: scope "delete obsolete methods" rule to internal (pub(crate)/private) methods only, deferring to root AGENTS.md deprecation path for public API methods

codecov · 2026-03-05T11:41:30Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

westonpace

I really like this collection. It feels almost like defining our own custom linter. It's like we have static analysis, dynamic analysis, and a new layer: inferential analysis.

westonpace · 2026-03-05T13:22:46Z

+- Use `PrimitiveArray::<T>::from(vec)` (zero-copy) instead of `from_iter_values(vec)` for Vec-to-PrimitiveArray conversion.
+- Implement `Default` trait on config/options structs instead of standalone `default_*()` helpers.
+- Place `#[cfg(test)] mod tests` as a single block at the bottom of each file — no production code after it.
+- Place `use` imports at the top of the file, not inline within function bodies.


Hurray 😆

westonpace · 2026-03-05T13:24:27Z

+- Use strongly-typed structs instead of `HashMap<String, String>` in APIs — convert to strings only at serialization boundaries.
+- Keep `RowAddr` (physical fragment+offset) and `RowId` (stable logical identifier) as distinct types — never raw `u64` for both.
+- Use `RowAddress` from `lance-core/src/utils/address.rs` instead of raw bitwise operations on row addresses.
+- Use `RowAddrTreeMap`/`RoaringBitmap` instead of `Vec<Range<u64>>` for physical row selections.


I think this one is a bit of an "it depends" but it is a reasonable default

westonpace · 2026-03-05T13:25:29Z

+- Keep traits minimal — only core abstraction methods. Move helpers to standalone functions and config to struct fields.
+- Get column/field types from schema metadata — never materialize data rows just to inspect types.
+- Use stable, versioned serialization formats for persistent storage (e.g., index files) — avoid unstable cross-version formats.
+- Convert `LargeBinary`, `LargeUtf8`, `Utf8View`, `BinaryView` to `Large*` variants, never to `Utf8`/`Binary` (i32 offset overflow risk).


I'm not entirely sure about this rule.

westonpace · 2026-03-05T13:26:23Z

+- Get column/field types from schema metadata — never materialize data rows just to inspect types.
+- Use stable, versioned serialization formats for persistent storage (e.g., index files) — avoid unstable cross-version formats.
+- Convert `LargeBinary`, `LargeUtf8`, `Utf8View`, `BinaryView` to `Large*` variants, never to `Utf8`/`Binary` (i32 offset overflow risk).
+- Use Arrow's type-safe access (`ArrayAccessor` trait bounds, `as_*_array` helpers) instead of `arrow::compute::cast` + `downcast_ref`.


Maybe even prefer the _opt variants unless the data type has already been verified.

westonpace · 2026-03-05T13:32:44Z

+- Hoist loop-invariant conditionals out of hot loops — branch once outside, then use separate loop bodies or monomorphized variants.
+- Pre-allocate single contiguous buffers; prefer `Vec::with_capacity` + `unsafe { set_len() }` over `extend` with dummy values when the buffer will be immediately overwritten.
+- Use `spawn_cpu()` only at the async-to-CPU boundary (e.g., FSST, decompression, batch materialization) — never nest redundant `spawn_cpu()` calls.
+- Use `OffsetView` instead of `borrow_to_typed_slice` for typed access to byte buffers — avoids cloning the entire buffer.


This rule is odd. borrow_to_typed_slice should not copy the buffer. We can probably strike this one.

westonpace · 2026-03-05T13:35:13Z

+### Dependencies
+
+- Keep `Cargo.lock` changes intentional; revert unrelated dependency bumps. Pin broken deps with a comment linking the upstream issue.
+- Gate optional/domain-specific deps behind Cargo feature flags. Prefer separate crates for domain functionality (geo, NLP).


Maybe add another rule about avoiding extra dependencies if possible to keep the crates light? I've noticed that claude is sometimes a little eager to grab external crates.

prrao87 · 2026-03-05T13:44:34Z

+- **All bugfixes and features must have corresponding tests. We do not merge code without tests.**
+- Use `rstest` (Rust) or `@pytest.mark.parametrize` (Python) for tests that differ only in inputs. Use `#[case::{name}(...)]` for readable case names.
+- Replace `print()` in tests with `assert` — prints don't catch regressions.
+- Extend existing tests instead of adding overlapping new ones. Add to existing `test_{module}.py` files.


Small nit, but I think this particular one shouldn't be language specific to Python.

Suggested change

- Extend existing tests instead of adding overlapping new ones. Add to existing `test_{module}.py` files.

- Extend existing tests instead of adding overlapping new ones. Add to existing test files.

- rust/AGENTS.md: add safety guardrails for unsafe set_len() guidance, default to buf.resize() and reserve unsafe for measured hot paths - rust/AGENTS.md: remove incorrect OffsetView vs borrow_to_typed_slice rule (borrow_to_typed_slice does not copy) - rust/AGENTS.md: remove LargeBinary/LargeUtf8 conversion rule (too prescriptive per maintainer feedback) - rust/AGENTS.md: add _opt variant preference for Arrow type-safe access - AGENTS.md: restore rust/examples/ in architecture listing - AGENTS.md: add rule to prefer std/existing deps before new crates - AGENTS.md: make "extend existing tests" rule language-agnostic

github-actions Bot added python java labels Mar 5, 2026

westonpace approved these changes Mar 5, 2026

View reviewed changes

prrao87 reviewed Mar 5, 2026

View reviewed changes

Xuanwo merged commit ab1a1c1 into main Mar 5, 2026

Xuanwo deleted the refactor/agents-md-overhaul branch March 5, 2026 16:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: overhaul AGENTS.md with PR review insights#6103

refactor: overhaul AGENTS.md with PR review insights#6103
Xuanwo merged 3 commits intomainfrom
refactor/agents-md-overhaul

Xuanwo commented Mar 5, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Mar 5, 2026

Uh oh!

codecov Bot commented Mar 5, 2026

Uh oh!

westonpace left a comment

Uh oh!

westonpace Mar 5, 2026

Uh oh!

westonpace Mar 5, 2026

Uh oh!

westonpace Mar 5, 2026

Uh oh!

westonpace Mar 5, 2026

Uh oh!

westonpace Mar 5, 2026

Uh oh!

westonpace Mar 5, 2026

Uh oh!

prrao87 Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	- Extend existing tests instead of adding overlapping new ones. Add to existing `test_{module}.py` files.
	- Extend existing tests instead of adding overlapping new ones. Add to existing test files.

Conversation

Xuanwo commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Mar 5, 2026

Review of PR #6103

Uh oh!

codecov Bot commented Mar 5, 2026

Codecov Report

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

westonpace Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

westonpace Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

westonpace Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

westonpace Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

westonpace Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

westonpace Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

prrao87 Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Xuanwo commented Mar 5, 2026 •

edited

Loading