wc: streaming --files0-from and other improvements#4696
wc: streaming --files0-from and other improvements#4696sylvestre merged 6 commits intouutils:mainfrom
Conversation
|
GNU testsuite comparison: |
|
Can you please run |
|
Yes. Apologies, I thought I had! |
|
GNU testsuite comparison: |
|
could you please split the PR into smaller commits ? This is a big patch ... |
|
GNU testsuite comparison: |
|
I've broken up the change into 4 different commits. The lines between some concerns were a little blurry, as I ended up touching a lot of stuff, but I've tried my best to keep them each cleanly readable on their own. I'm still working on getting the GNU coreutils tests running completely, but I've at least got the one complaining to run. And, it turns out |
7c5c68c to
9f6b788
Compare
|
GNU testsuite comparison: |
|
please remove the "draft" when ready to be reviewed :) |
|
Alright, I think I'm ready for some feedback to iterate on here. On one point in particular, I didn't realize that when I introduced use of AsRawFd/FromRawFd that the import paths were new as of 1.66. Also, the handy |
We do it only after cutting a release and check how important the new features are. Essentially we bump it by need. But always in a separate PR. |
tertsdiepraam
left a comment
There was a problem hiding this comment.
Pretty cool stuff! I would suggest that you keep patches much smaller in the future though.
| show_words: bool, | ||
| show_max_line_length: bool, | ||
| files0_from_path: Option<PathBuf>, | ||
| files0_from: Option<Input<'a>>, |
There was a problem hiding this comment.
If you make things more complicated for performance reasons (which I assume the lifetimes are for), you do kind of need to show that performance improves. I bet it doesn't really matter, because it happens just once and this is not a hot path. The same applies to some of the Cow strings. If you just get rid of cloning and do not introduce complexity it's always fine of course.
There was a problem hiding this comment.
I get perhaps irrationally obsessive about avoiding unnecessary allocations.... But, I was actually pleased with how simple this whole scheme ended up being. In my mind, 'a was just short for "command-line Args", which is what we're aiming to borrow from here... I almost came up with more complicated schemes with other lifetimes specified, but they weren't really helpful.
jeddenlea
left a comment
There was a problem hiding this comment.
Thanks for the feedback! I've done my best to address your concerns within each of the 4 commits, and I've rebased them all atop main.
| show_words: bool, | ||
| show_max_line_length: bool, | ||
| files0_from_path: Option<PathBuf>, | ||
| files0_from: Option<Input<'a>>, |
There was a problem hiding this comment.
I get perhaps irrationally obsessive about avoiding unnecessary allocations.... But, I was actually pleased with how simple this whole scheme ended up being. In my mind, 'a was just short for "command-line Args", which is what we're aiming to borrow from here... I almost came up with more complicated schemes with other lifetimes specified, but they weren't really helpful.
|
See this failure: |
|
Hey Sylvestere. Yeah, that's what prompted my question about your policy around minimum versions.... If you anticipate moving to 1.67 soon, I'd leave that in place. Otherwise, we'll need to do something different. It's unfortunate Rust hasn't stabilized its version check, https://doc.rust-lang.org/beta/unstable-book/language-features/cfg-version.html |
src/uu/wc/src/wc.rs
Outdated
| let input = match maybe_input { | ||
| Ok(input) => input, | ||
| Err(err) => { | ||
| record_error!("{err}"); |
There was a problem hiding this comment.
| record_error!("{err}"); | |
| show!(e); |
(This might need an into)
So instead of io::Error -> String -> USimpleError, we do io::Error -> UIoError.
There was a problem hiding this comment.
Yeah that's slick! So now not only do we get to punt on the naming of show, but we still get to make each case of it into a single line which was the whole point of record_error to begin with.
| fn try_iter( | ||
| &'a self, | ||
| settings: &'a Settings<'a>, |
There was a problem hiding this comment.
The structure of functions here is what gets you into trouble with a lot of references and the Cow in Input. What if you make Input always a reference and then structure it like this:
fn count_inputs(inputs: Inputs) -> ... {
match inputs {
Inputs::Stdin => count_input(Input::Stdin(StdinKind::Implicit)),
Inputs::Paths(paths) => /* call count_input for each path */,
Self::Files0From(path) => /* call count_input for each line */
}Then you don't have to pass Iterators around which makes lifetimes a lot easier.
There was a problem hiding this comment.
The iterator returned by try_iter takes care of so much, though. It also helps keep the flow of the program consistent with what was happening before. It used to be that it would only iterate through a [Input], then we couldn't handle streams. I wanted to be able to handle streams from --files0-from in the same manner as a list of command line arguments, and it has also made it easy to consistently handle errors like empty file names.
I should have mentioned already, and I'm embarrassed I still haven't gotten to it yet, but it was my intent to write some more tests to capture the new cases in which this change makes this wc act like GNU's. The formatting of the errors when empty paths are found in either a --files0-from or the command line is one "feature" we get with this.
I think not just for this.
Honestly, I'm glad 😄 At least for this project, keeping a consistent MSRV would always be the best option. For libraries, it might be different of course. |
Oh sure, this is not a killer feature worth it on its own. I'll see about copying it in verbatim if it's short, or writing a facsimile...
Those jerks just keep adding useful stuff to the standard library! This is certainly not the first time I stumbled onto something useful that just happened to be very new and caused this problem. |
|
With this PR: #4460 i think it needs a rebase :) |
|
@jeddenlea on android: rings a bell ? |
I had made a symlink named |
|
GNU testsuite comparison: |
|
GNU testsuite comparison: |
|
The GnuTest failures look like a hiccup with |
|
nope, I think the test doesn't work on windows. see: it times out |
🤦 I guess I could have looked myself! Sorry about that. I think I understand what the failure is, and it shouldn't be too hard to fix in theory. I'm currently in the middle of getting VirtualBox running so I can actually work on the thing in Windows.... |
The Settings object did not need a QuotingStyle member, it was basically a static.
print_stats will now take advantage of the buffer built into io::stdout(). We can also waste fewer lines on show! by making a helper macro.
WcError should not hold Strings which are clones of `&'static str`s. thiserror provides a convenient API to declare errors with their Display messages. wc quotes the --files0-from source slightly differently when reporting errors about empty file names.
Sadly ilog10 isn't available until we use 1.67, but we can get close in the meantime.
My previous commits meant to bring our wc's output and behavior in line with GNU's. There should be tests that check for these changes! I found a stupid bug in my own changes, I was not adding 1 to the indexes produced by .enumerate() when printing errors.
|
GNU testsuite comparison: |
|
Woo! Looks like I fixed the Windows stuff. That was a.... pain. VirtualBox didn't work as it is wont to do. I ended up using a free t2.micro on AWS, probably the first time I've used Windows since I almost threw a Windows 7 laptop out of my window because it was so bloody slow and awful! Anyway, in the interest of preventing any individual commits entering the tree which fail CI on their own, I patched the last two here and rebased them all again. You can see the diffs here: https://github.com/jeddenlea/coreutils/compare/old_wc..wc |
|
impressive, well done :) |
My original focus was on
--files0-from, which should be handled as a stream if the file specified (or stdin) is not a regular file. I've accomplished this by making a newInputstype to represent the desired list of files to process. Itstry_itermethod does most of the heavy lifting to figure out how to actually run.Secondarily, I have attempted to reduce the number of String allocations that occur while printing the results. Now, unless a file name needs escaping, or an error occurs, no additional allocations should be necessary to print results. An
Inputnow borrows file names from the command line, unless they're read from--files0-from.print_statswas the biggest abuser, allocating aStringfor each column before before beingjoin'd into aStringfor the whole line. TheTitledWordCounttype is not necessary at all, which was another source ofStringallocation.Finally, I've made some effort to make more cases match GNU wc's output. Errors encountered processing
--files0-fromas well as any encountered processing files listed on the command line will be escaped more consistently like GNU wc. File names printed to the right of their stats will be less aggressively quoted, to match GNU wc, now only if they are not UTF-8 or if they contain a newline.