test: simplify IO tests#5228
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #5228 +/- ##
==========================================
+ Coverage 82.01% 82.05% +0.03%
==========================================
Files 342 342
Lines 141522 141532 +10
Branches 141522 141532 +10
==========================================
+ Hits 116073 116130 +57
+ Misses 21611 21562 -49
- Partials 3838 3840 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Previously, IO statistics were only available in Rust via the IOTracker wrapper. This adds a Python API to access IO stats through dataset.io_stats(). The implementation includes: - New IoStats pyclass in python/src/dataset/io_stats.rs - io_stats() method on Dataset that returns incremental stats - Python wrapper in LanceDataset class with documentation - Refactored all tests to use dataset.object_store().io_stats() instead of explicit IOTracker instances 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit addresses issues (1), (2), and (3) from the code review: **Issue 1**: Gate unused field behind test-util feature - Added #[cfg(feature = "test-util")] to IoTrackingMultipartUpload::path field - This eliminates the unused field warning in production builds **Issue 2**: Add both snapshot and incremental IO stats methods - Added IOTracker::stats() for non-resetting reads (returns clone) - Renamed ObjectStore::io_stats() to io_stats_incremental() - Added ObjectStore::io_stats_snapshot() for non-resetting reads - Updated all call sites (47 locations) to use io_stats_incremental() - Python API now has: - io_stats_snapshot(): Read-only, doesn't reset counters - io_stats_incremental(): Returns delta and resets counters **Issue 3**: Python type hints - TYPE_CHECKING was already properly configured - IOStats type hint works correctly with existing imports The new API makes the resetting behavior explicit in method names, improving clarity and preventing confusion about when counters reset. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Previously, local filesystem operations bypassed the IO tracking layer because LocalObjectReader reads directly from file handles instead of going through the object_store layer. This commit adds IO tracking for local filesystem reads: **Changes**: - Added `io_tracker: Option<Arc<IOTracker>>` field to `LocalObjectReader` - Added `IOTracker::record_read()` public method for direct recording - Added `LocalObjectReader::open_with_tracker()` internal method - Updated `ObjectStore::open()` and `open_with_size()` to pass IOTracker - Modified `get_range()` and `get_all()` to record operations after reads - Backward compatible: existing direct calls to `LocalObjectReader::open()` still work (tracker is optional) **Testing**: Verified with Python test showing: - Local file reads are now tracked (4 IOPs, 26986 bytes for 1000 rows) - Incremental tracking works correctly - Both snapshot and incremental APIs work for local files This ensures consistent IO tracking across all storage backends (local, S3, GCS, Azure, etc.) giving users complete visibility into their IO operations regardless of where data is stored. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
f26cb88 to
5fb6d64
Compare
|
@westonpace this is meant to address your comment in #4923 (review) |
| /// | ||
| /// This metric helps understand IO parallelism. A lower number indicates | ||
| /// more parallel IO operations. | ||
| pub num_hops: u64, |
There was a problem hiding this comment.
calling these hops/network hops isn't clear to me. Would this be network hops behind the S3 endpoint? That's how I would interpret it but I don't think we have that info
There was a problem hiding this comment.
is this just network requests?
There was a problem hiding this comment.
Maybe I should give an example in the comment.
Imagine a process:
- Call list to get 10 files
- In parallel, call head on 10 files
- Read the largest file
That's a total of 12 requests (list, 10 heads, 1 get). But we do them in 3 "hops". Maybe that's not the best term. Could do "stages" or something else.
There was a problem hiding this comment.
I see. Makes me think of dependency chains - not sure if that's a helpful term.
There was a problem hiding this comment.
I agree that "hops" is not quite the correct term (though I am used to our usage of it in this way from the Rust unit tests).
I also am not aware of any standard term.
Is this something we want to expose? Should we mention it is likely meaningless if there are concurrent operations in flight? Or that it can be a somewhat noisy metric?
There was a problem hiding this comment.
I think I'll rename it for num_stages. You're also right it's mostly useful for testing. So I'll gate it under test-utils.
| }) | ||
| }); | ||
|
|
||
| // Record the read operation if tracking is enabled |
There was a problem hiding this comment.
is the optionality required? Elsewhere in the PR it seems to indicate it'll always be enabled
There was a problem hiding this comment.
I can remove it. There is a constructor that doesn't pass a tracker down (used for tests I think). But I can just make it create an empty stats instance and record unconditionally. That can simplify some code.
westonpace
left a comment
There was a problem hiding this comment.
Are these counters process-wide or dataset-wide?
A few questions but no real concerns.
| /// IO statistics for dataset operations | ||
| /// | ||
| /// This tracks the number of IO operations and bytes transferred for read and write | ||
| /// operations performed on the dataset's object store. | ||
| /// | ||
| /// Note: Calling `io_stats()` returns the statistics accumulated since the last call | ||
| /// and resets the internal counters (incremental stats pattern). | ||
| #[pyclass(name = "IOStats", module = "_lib", get_all)] | ||
| #[derive(Clone, Debug)] | ||
| pub struct IoStats { |
There was a problem hiding this comment.
It would be nice if we could include these docs in mkdocs. I'll make an issue to figure that out. Then we wouldn't need the lengthy Returns block on the python.
| /// | ||
| /// This metric helps understand IO parallelism. A lower number indicates | ||
| /// more parallel IO operations. | ||
| pub num_hops: u64, |
There was a problem hiding this comment.
I agree that "hops" is not quite the correct term (though I am used to our usage of it in this way from the Rust unit tests).
I also am not aware of any standard term.
Is this something we want to expose? Should we mention it is likely meaningless if there are concurrent operations in flight? Or that it can be a somewhat noisy metric?
| }) | ||
| }); | ||
|
|
||
| // Record the read operation if tracking is enabled |
| #[cfg(not(feature = "test-util"))] | ||
| let _ = (method, path); // Suppress unused variable warnings |
There was a problem hiding this comment.
What are we feature gating here? Tracking of every request's path / method / range vs. just tracking the counts?
There was a problem hiding this comment.
Yeah with test-util enabled, we will track a list of all requests made in the IO stats. This makes it much easier to debug a failing test.
But for normal usage, keeping track of those will just accumulate a lot of data for now reason.
This particular gate is just because method and path are only used for the request tracking, so added a line to suppress the unused variable warning.
This PR makes it easier to make assertions about IO * Make IO statistics on by default. * Make IO statistics tracked even for local object reader (which previously bypassed statistics) * Expose IO stats in Python --------- Co-authored-by: Claude <noreply@anthropic.com>
This PR makes it easier to make assertions about IO