Skip to content

test: simplify IO tests#5228

Merged
wjones127 merged 11 commits intolance-format:mainfrom
wjones127:feat/simplify-io-tests
Nov 20, 2025
Merged

test: simplify IO tests#5228
wjones127 merged 11 commits intolance-format:mainfrom
wjones127:feat/simplify-io-tests

Conversation

@wjones127
Copy link
Copy Markdown
Contributor

@wjones127 wjones127 commented Nov 12, 2025

This PR makes it easier to make assertions about IO

  • Make IO statistics on by default.
  • Make IO statistics tracked even for local object reader (which previously bypassed statistics)
  • Expose IO stats in Python

@github-actions github-actions Bot added the chore label Nov 12, 2025
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Nov 12, 2025

Codecov Report

❌ Patch coverage is 86.84211% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.05%. Comparing base (64168b4) to head (3542f68).

Files with missing lines Patch % Lines
rust/lance-io/src/object_store.rs 69.69% 10 Missing ⚠️
rust/lance-io/src/utils/tracking_store.rs 85.71% 3 Missing ⚠️
rust/lance-io/src/local.rs 83.33% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5228      +/-   ##
==========================================
+ Coverage   82.01%   82.05%   +0.03%     
==========================================
  Files         342      342              
  Lines      141522   141532      +10     
  Branches   141522   141532      +10     
==========================================
+ Hits       116073   116130      +57     
+ Misses      21611    21562      -49     
- Partials     3838     3840       +2     
Flag Coverage Δ
unittests 82.05% <86.84%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

wjones127 and others added 5 commits November 13, 2025 12:30
Previously, IO statistics were only available in Rust via the IOTracker
wrapper. This adds a Python API to access IO stats through dataset.io_stats().

The implementation includes:
- New IoStats pyclass in python/src/dataset/io_stats.rs
- io_stats() method on Dataset that returns incremental stats
- Python wrapper in LanceDataset class with documentation
- Refactored all tests to use dataset.object_store().io_stats() instead of
  explicit IOTracker instances

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit addresses issues (1), (2), and (3) from the code review:

**Issue 1**: Gate unused field behind test-util feature
- Added #[cfg(feature = "test-util")] to IoTrackingMultipartUpload::path field
- This eliminates the unused field warning in production builds

**Issue 2**: Add both snapshot and incremental IO stats methods
- Added IOTracker::stats() for non-resetting reads (returns clone)
- Renamed ObjectStore::io_stats() to io_stats_incremental()
- Added ObjectStore::io_stats_snapshot() for non-resetting reads
- Updated all call sites (47 locations) to use io_stats_incremental()
- Python API now has:
  - io_stats_snapshot(): Read-only, doesn't reset counters
  - io_stats_incremental(): Returns delta and resets counters

**Issue 3**: Python type hints
- TYPE_CHECKING was already properly configured
- IOStats type hint works correctly with existing imports

The new API makes the resetting behavior explicit in method names,
improving clarity and preventing confusion about when counters reset.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Previously, local filesystem operations bypassed the IO tracking layer
because LocalObjectReader reads directly from file handles instead of
going through the object_store layer.

This commit adds IO tracking for local filesystem reads:

**Changes**:
- Added `io_tracker: Option<Arc<IOTracker>>` field to `LocalObjectReader`
- Added `IOTracker::record_read()` public method for direct recording
- Added `LocalObjectReader::open_with_tracker()` internal method
- Updated `ObjectStore::open()` and `open_with_size()` to pass IOTracker
- Modified `get_range()` and `get_all()` to record operations after reads
- Backward compatible: existing direct calls to `LocalObjectReader::open()`
  still work (tracker is optional)

**Testing**:
Verified with Python test showing:
- Local file reads are now tracked (4 IOPs, 26986 bytes for 1000 rows)
- Incremental tracking works correctly
- Both snapshot and incremental APIs work for local files

This ensures consistent IO tracking across all storage backends (local,
S3, GCS, Azure, etc.) giving users complete visibility into their IO
operations regardless of where data is stored.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@wjones127 wjones127 force-pushed the feat/simplify-io-tests branch from f26cb88 to 5fb6d64 Compare November 13, 2025 20:38
@wjones127
Copy link
Copy Markdown
Contributor Author

@westonpace this is meant to address your comment in #4923 (review)

@wjones127 wjones127 marked this pull request as ready for review November 14, 2025 00:10
Comment thread python/src/dataset/io_stats.rs Outdated
///
/// This metric helps understand IO parallelism. A lower number indicates
/// more parallel IO operations.
pub num_hops: u64,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calling these hops/network hops isn't clear to me. Would this be network hops behind the S3 endpoint? That's how I would interpret it but I don't think we have that info

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this just network requests?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I should give an example in the comment.

Imagine a process:

  1. Call list to get 10 files
  2. In parallel, call head on 10 files
  3. Read the largest file

That's a total of 12 requests (list, 10 heads, 1 get). But we do them in 3 "hops". Maybe that's not the best term. Could do "stages" or something else.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Makes me think of dependency chains - not sure if that's a helpful term.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that "hops" is not quite the correct term (though I am used to our usage of it in this way from the Rust unit tests).

I also am not aware of any standard term.

Is this something we want to expose? Should we mention it is likely meaningless if there are concurrent operations in flight? Or that it can be a somewhat noisy metric?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'll rename it for num_stages. You're also right it's mostly useful for testing. So I'll gate it under test-utils.

Comment thread rust/lance-io/src/local.rs Outdated
})
});

// Record the read operation if tracking is enabled
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the optionality required? Elsewhere in the PR it seems to indicate it'll always be enabled

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can remove it. There is a constructor that doesn't pass a tracker down (used for tests I think). But I can just make it create an empty stats instance and record unconditionally. That can simplify some code.

@wjones127 wjones127 requested a review from westonpace November 18, 2025 18:56
Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these counters process-wide or dataset-wide?

A few questions but no real concerns.

Comment thread python/src/dataset.rs Outdated
Comment thread python/src/dataset.rs Outdated
Comment on lines +8 to +17
/// IO statistics for dataset operations
///
/// This tracks the number of IO operations and bytes transferred for read and write
/// operations performed on the dataset's object store.
///
/// Note: Calling `io_stats()` returns the statistics accumulated since the last call
/// and resets the internal counters (incremental stats pattern).
#[pyclass(name = "IOStats", module = "_lib", get_all)]
#[derive(Clone, Debug)]
pub struct IoStats {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if we could include these docs in mkdocs. I'll make an issue to figure that out. Then we wouldn't need the lengthy Returns block on the python.

Comment thread python/src/dataset/io_stats.rs Outdated
///
/// This metric helps understand IO parallelism. A lower number indicates
/// more parallel IO operations.
pub num_hops: u64,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that "hops" is not quite the correct term (though I am used to our usage of it in this way from the Rust unit tests).

I also am not aware of any standard term.

Is this something we want to expose? Should we mention it is likely meaningless if there are concurrent operations in flight? Or that it can be a somewhat noisy metric?

Comment thread rust/lance-io/src/local.rs Outdated
})
});

// Record the read operation if tracking is enabled
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question.

Comment on lines +252 to +253
#[cfg(not(feature = "test-util"))]
let _ = (method, path); // Suppress unused variable warnings
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are we feature gating here? Tracking of every request's path / method / range vs. just tracking the counts?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah with test-util enabled, we will track a list of all requests made in the IO stats. This makes it much easier to debug a failing test.

But for normal usage, keeping track of those will just accumulate a lot of data for now reason.

This particular gate is just because method and path are only used for the request tracking, so added a line to suppress the unused variable warning.

@wjones127 wjones127 merged commit 1fa26ac into lance-format:main Nov 20, 2025
24 of 25 checks passed
@wjones127 wjones127 deleted the feat/simplify-io-tests branch November 20, 2025 22:38
jackye1995 pushed a commit to jackye1995/lance that referenced this pull request Jan 21, 2026
This PR makes it easier to make assertions about IO

* Make IO statistics on by default.
* Make IO statistics tracked even for local object reader (which
previously bypassed statistics)
* Expose IO stats in Python

---------

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants