FEAT CLI conversion#249
Conversation
|
No detailed review from me yet, just some more general things to discuss:
|
|
Ready for review @skops-dev/maintainers :) |
|
Addressing @BenjaminBossan's comments:
|
|
My impression is that some of the things we discuss depend on what the general expectations of a command line tool are. I've never seen the use of (To be clear, I don't think we need rich, it's just in case we want to make the output prettier.) Regarding output dir, we could keep it (but I wonder if
On the other hand, having a way to indicate the output file name is something I'd expect from similar tools, that's why I would have added it (but we could do that later). Regarding the |
|
As for |
adrinjalali
left a comment
There was a problem hiding this comment.
I don't think we need an empty __init__.py file, do we?
Nothing major here, looks pretty good to me. I don't have a strong opinion on print vs logging, but if we're using logging, we should certainly get an instance for this library, and under that a CLI one, instead of using the global instance.
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
|
Addressed your PR comments, feel free to have another look @adrinjalali :) As it stands:
I still want to add in documentation to cover the additions, I could maybe use someone pointing me in the right direction to write up changes in the docs folder as I've not got much experience writing documentation :') |
BenjaminBossan
left a comment
There was a problem hiding this comment.
Looks quite good, but there still some points to discuss. (Interesting how many design decisions still have to be made for such a seemingly simple task.)
but if we're using logging, we should certainly get an instance for this library,
I think this has not been addressed. IIUC, Adrin means that we should have a logger instance, i.e. logger = logging.getLogger(...) and call logger.info(...) etc.
Personally, I still find it strange to use logging here, but if you two agree, it's fine by me.
Users can pass in multiple input files and output file names
Really not sure if we need this or whether it makes our code more complicated then needs be? If users want to convert multiple files, there are already ways to achieve this in the command line, e.g. through globbing or a small bash script.
I still want to add in documentation to cover the additions, I could maybe use someone pointing me in the right direction to write up changes in the docs folder as I've not got much experience writing documentation :')
I'm sure you can do it, read the existing docs for persistence and then try to put yourself in the shoes of a user, what information would be most important to them? The outcome doesn't have to be perfect on first try, we can iterate in this review.
Could you please also add an entry to changes.rst?
One more minor thing, when I call skops --help, I get this error:
$ skops --help
usage: Skops {convert}
Skops: error: the following arguments are required: function
|
On holidays, so I'll either review next year or happy for this to be merged, but one think which I think is important is for us to have the |
|
I'll be on holiday too until next year, but will take a look from time to time when I have the opportunity. |
|
Ok, phew, should now be ready to review again @skops-dev/maintainers :) A couple of notes:
|
BenjaminBossan
left a comment
There was a problem hiding this comment.
This looks pretty good, I tested it locally and it works as expected. Thanks also for adding to the docs.
Btw. I checked and it seems we already have a transient dependency on click, so it would theoretically not cost us anything to build our CLI with it. I assume that after all this work, you're not interested in rewriting it with click, which is fine, I just wanted to mention it.
There still a few points left that I commented on, please take a look. Also, I wonder about this comment:
but if we're using logging, we should certainly get an instance for this library,
I think this has not been addressed. IIUC, Adrin means that we should have a logger instance, i.e. logger = logging.getLogger(...) and call logger.info(...) etc.
| """Takes in verbosity from a CLI entrypoint (number of times -v specified), | ||
| and sets the logger to the required log level""" | ||
|
|
||
| all_levels = [logging.WARNING, logging.INFO, logging.DEBUG] |
There was a problem hiding this comment.
IIUC, warning level logs would always be shown, even if I don't add -v. IMO, this should not be the case. So I would prefer that without -v, the log level should be ERROR or CRITICAL.
There was a problem hiding this comment.
From the python docs:
The default level is WARNING, which means that only events of this level and above will be tracked, unless the logging package is configured to do otherwise.
I followed this because it's the default for python packages.
We can adjust the default level for our CLI, or we could also add a -q, --quiet option to reduce the log level
There was a problem hiding this comment.
I agree this is true for logging in general, but for CLIs, there is the convention to not print anything if there is no issue. Ideally, I would like to avoid having a --quiet option, because it increases complexity and because it's unclear how it interacts with -v.
There was a problem hiding this comment.
Hmm, good point.
I would raise that it could be considered an issue if untrusted types were present.
If a user had a file with untrusted types in it, and we converted it silently, they wouldn't know they needed to specify anything as trusted when using it.
There was a problem hiding this comment.
IMO, this is more of an issue for the person loading the file, and they will know (unless they use trusted=True). Personally, I would not report it by default, but I can live with the current way.
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
|
Thanks for the review @BenjaminBossan, I've made adjustments along the lines of most of your comments, let me know if there's anything I've missed. WRT logging, I've moved to using a That way, if we do implement any different logging elsewhere in our codebase, it automatically keeps the config specified for whatever command we're calling. Let me know if you think there's a better way to achieve this; Also, the log format currently looks like:
How does this look to all? We can add things like time, function call, etc at a later date if we need it. Also, when we implement logging somewhere else, might make sense to move this formatting out to a common module. |
BenjaminBossan
left a comment
There was a problem hiding this comment.
Thanks for implementing the changes, this looks great to me now. Let's wait for one more review before merging.
WRT logging, I've moved to using a logger instance, but still set a basic config at the top level for our selected entrypoint, as I felt that the config should be set at the top level for whatever entrypoint we're using.
Also, when we implement logging somewhere else, might make sense to move this formatting out to a common module.
I think it's good for the time being.
Also, the log format currently looks like:
INFO : No unknown types found in abc.
How does this look to all? We can add things like time, function call, etc at a later date if we need it.
Yes, let's keep it simple for now.
|
|
||
| Parameters | ||
| ---------- | ||
| input_file : os.PathLike |
There was a problem hiding this comment.
Hard to say if a user who doesn't know the type would guess that str is a path-like. Anyway, I won't object to leaving it as is, especially since it means that the type annotation and docstring are identical, I just wanted to bring up the possible friction here.
| f"While converting {model_name}, " | ||
| "the following unknown types were found: " | ||
| f"{untrusted_str}. " | ||
| f"When loading {output_file}, add 'trusted=True' to the skops.load call. " |
There was a problem hiding this comment.
I think this touches on a more general point of how we should deal with trusted=True. I posted on discord about it, which is a better place to discuss than here :)
| """Takes in verbosity from a CLI entrypoint (number of times -v specified), | ||
| and sets the logger to the required log level""" | ||
|
|
||
| all_levels = [logging.WARNING, logging.INFO, logging.DEBUG] |
There was a problem hiding this comment.
I agree this is true for logging in general, but for CLIs, there is the convention to not print anything if there is no issue. Ideally, I would like to avoid having a --quiet option, because it increases complexity and because it's unclear how it interacts with -v.
| def _convert_file( | ||
| input_file: os.PathLike, | ||
| output_file: os.PathLike, | ||
| logger: logging.Logger = logging.getLogger(), |
There was a problem hiding this comment.
I wonder if we should instantiate the logger just once at the root of the module, and then just set logger=logger here. Is there any disadvantage to that?
There was a problem hiding this comment.
Hmm, we could, although that would mean it gets initialised even if we import functions from the file to use elsewhere.
I don't think it would really impact much, I just personally dislike writing anything other than definitions at the top level
There was a problem hiding this comment.
that would mean it gets initialised even if we import functions from the file to use elsewhere.
That's true. On the other hand it's now being initialized each time the function is called. Given that CLI usage consists of creating a new process anyway, it makes indeed no difference either way. If we ever get to a point where we use the logger elsewhere, we might have to rethink this.
| def _convert_file( | ||
| input_file: os.PathLike, | ||
| output_file: os.PathLike, | ||
| logger: logging.Logger = logging.getLogger(), |
There was a problem hiding this comment.
that would mean it gets initialised even if we import functions from the file to use elsewhere.
That's true. On the other hand it's now being initialized each time the function is called. Given that CLI usage consists of creating a new process anyway, it makes indeed no difference either way. If we ever get to a point where we use the logger elsewhere, we might have to rethink this.
| """Takes in verbosity from a CLI entrypoint (number of times -v specified), | ||
| and sets the logger to the required log level""" | ||
|
|
||
| all_levels = [logging.WARNING, logging.INFO, logging.DEBUG] |
There was a problem hiding this comment.
IMO, this is more of an issue for the person loading the file, and they will know (unless they use trusted=True). Personally, I would not report it by default, but I can live with the current way.
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
|
@BenjaminBossan do we still want one more review before merging this in? |
BenjaminBossan
left a comment
There was a problem hiding this comment.
Thanks so much for the hard work. This LGTM and can be merged if @adrinjalali is also fine with it.
As mentioned in some of the comments, there are some things that I think can be improved, but they are not blocking this PR, as they can be added later without BC breakage.
adrinjalali
left a comment
There was a problem hiding this comment.
Thanks @E-Aho . I think we might need improve docs when we get back to this, but happy to merge this and have it out.
Addresses #241
This PR adds in an entrypoint,
skops convert, which allows users to automatically convert files from the command line.I didn't end up adding
inspect, as it felt odd to me to inspect a .pkl file which would require loading it intoskopsformat based on how we currently have the safety checks implemented. Happy to implement an entrypoint to do it if other maintainers think it would be useful, it would essentially just need to not save the outputskopsin the current conversion method.