feat: support manifeset summary and get it from Version#4754
feat: support manifeset summary and get it from Version#4754yanghua merged 19 commits intolance-format:mainfrom
Conversation
5ac2cf3 to
2e3322c
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4754 +/- ##
==========================================
+ Coverage 80.88% 80.90% +0.02%
==========================================
Files 323 323
Lines 127559 127695 +136
Branches 127559 127695 +136
==========================================
+ Hits 103171 103308 +137
Misses 20754 20754
+ Partials 3634 3633 -1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
390ee29 to
0dabbf7
Compare
0dabbf7 to
39d0f18
Compare
| /// - total-data-files: Total number of data files across all fragments | ||
| /// - total-deletions: Total number of deleted records | ||
| /// - total-deletion-files: Number of fragments with deletion files | ||
| pub fn summary(&self) -> BTreeMap<String, String> { |
There was a problem hiding this comment.
Why output as a map of strings? Then the caller has to parse the integers. Seems like it would be better to just expose the ManifestStats as public, right?
There was a problem hiding this comment.
I was planning to extend the summary to transaction as #4337 metioned. For that part we may need string. And after that I probobly use the stats in Version struct which already have a map of string(the final usage I think accepts string more friendly).
I think you are right we could expose ManifestStats and then transform to map<string, string> for Version struct. That should be more flexible.
wjones127
left a comment
There was a problem hiding this comment.
Seems good. I was thinking we'd want these kinds of stats easily available anyways. Thanks!
| fn from(val: ManifestStats) -> Self { | ||
| let mut stats_map = Self::new(); | ||
| stats_map.insert( | ||
| "total-fragments".to_string(), |
There was a problem hiding this comment.
kebab case does seem like an odd choice. I'm guessing this is the match the iceberg statistics?
There was a problem hiding this comment.
kebab case does seem like an odd choice. I'm guessing this is the match the iceberg statistics?
The initial idea was inspired by Iceberg, and we utilized the summary for management functions. However, matching the naming convention wasn't intentional; I just didn't give it much thought.
I've switched it to snake_case. Appreciate you catching that!
| summary.total_data_files += f.files.len() as u64; | ||
| // Sum the number of rows for the current fragment (if available) | ||
| if let Some(num_rows) = f.num_rows() { | ||
| summary.total_records += num_rows as u64; |
There was a problem hiding this comment.
I think it's better to say total-data-file-rows, total-deletion-file-rows instead, and we can have a total-rows that is total-data-file-rows - total-deletion-file-rows
There was a problem hiding this comment.
Oh, you prefer kebab case or snake case? I'm OK with both
There was a problem hiding this comment.
Oh I just copied it from the docstring, looks like you need to update them to be consistent.
There was a problem hiding this comment.
It's done.
Feel free to let me know if you have any other thoughts
| /// Get the summary information of a manifest. | ||
| /// | ||
| /// This function calculates various statistics about the manifest, including: | ||
| /// - total-records: Total number of records in the dataset |
There was a problem hiding this comment.
looks like these docs are still not updated to snake case?
There was a problem hiding this comment.
PTAL and feel free to merge if no other comments
2db8a2c to
be4d34c
Compare
This PR follows #4754 : ManifestSummary is useful in many scenarios. For example, user can get statistical information about their datasets, and computing engines such as Flink and Spark can dynamically decide source parallelism and resource spec. --------- Co-authored-by: 喆宇 <wxy407679@antgroup.com>
This PR follows lance-format#4754 : ManifestSummary is useful in many scenarios. For example, user can get statistical information about their datasets, and computing engines such as Flink and Spark can dynamically decide source parallelism and resource spec. --------- Co-authored-by: 喆宇 <wxy407679@antgroup.com>
Discussion: #4337
Still we can't get transaction information directly from manifest. This will be enabled after feature #4308