Generate minimal README.md file in repository initialization (issue #113)#207
Generate minimal README.md file in repository initialization (issue #113)#207jucamohedano wants to merge 40 commits intoskops-dev:mainfrom
Conversation
adrinjalali
left a comment
There was a problem hiding this comment.
Thanks for the PR @jucamohedano
| dump_json(Path(dst) / "config.json", config) | ||
|
|
||
|
|
||
| def _create_readme( |
There was a problem hiding this comment.
I don't think we need this function.
We can probably simply add such two lines to the init:
model_card = card.Card(model, metadata=card.metadata_from_config(Path(local_repo)))
model_card.save(Path(local_repo) / "README.md")There was a problem hiding this comment.
That simplifies it a lot! I applied the change :)
merveenoyan
left a comment
There was a problem hiding this comment.
Thanks a lot, I left a very general feedback 🙂
| data=iris.data, | ||
| ) | ||
| _validate_folder(path=dir_path) | ||
| assert os.path.isfile(Path(dir_path) / "README.md") |
There was a problem hiding this comment.
I wonder if we should simply add it under test_init to deduplicate (essentially everything before this step is same and this behavior will be a part of init anyway)? WDYT? @adrinjalali @jucamohedano
There was a problem hiding this comment.
That's true, this specific one can be in test_init.
There was a problem hiding this comment.
I agree. Made the change.
| t1 = os.path.getmtime(Path(dir_path) / "README.md") | ||
|
|
||
| # compare the times at which the files were last modified | ||
| assert t0 != t1 |
There was a problem hiding this comment.
I think instead of this we can render the content of model card and compare whether they're same or not, WDYT? @adrinjalali @jucamohedano
There was a problem hiding this comment.
We want to make sure content is modified so it would be nice to have the former.
There was a problem hiding this comment.
I made the change :)
adrinjalali
left a comment
There was a problem hiding this comment.
Thanks @jucamohedano , it's almost ready to be merged :)
|
|
||
|
|
||
| @pytest.mark.network | ||
| @pytest.mark.inference |
There was a problem hiding this comment.
this is probably removed by mistake
There was a problem hiding this comment.
Thank you! I removed that somehow by mistake. I added it back.
| # test inference backend for classifier and regressor models. This test can | ||
| # take a lot of time and be flaky. | ||
| # test inference backend for classifier and regressor models. |
There was a problem hiding this comment.
Thanks again! I added it back :)
|
@BenjaminBossan any idea why the CI fails here? It's really odd to me. |
I didn't follow this PR along. Is the expectation here that the error should be caught by the |
|
Oh I see where the issue comes from. The error is raised inside this test: So far, we have allowed people to pass an empty model file, we just raise a warning if they do so. Now, this PR tries to actually load the model file and do things with it, which means it expects it to be a valid model file. So, @BenjaminBossan , the question is: do we allow people to init a repo with an invalid model file? I'm leaning towards raising if the model file doesn't contain a valid model. WDYT? |
I would be okay with that, don't really see a use case where you'd start out without a model. |
|
#214 should fix the issue. |
|
@jucamohedano note that I've merge with latest |
|
@jucamohedano would you be able to debug the issue? |
As a user, I would be surprised to find out that Overall, this has become much more complicated than I initially expected. I wondered if we could just use a dummy model here, since the model card needs to be later edited by the user anyway, but I think this could also be confusing. So in sum, I'm not sure what the best way is. @adrinjalali what would you suggest? |
|
These are all good points. It makes me think that probably the best way is to get a model object and the path to where it's supposed to have been saved, and not load the model in |
|
I agree with the suggestion from @adrinjalali, otherwise it looks like we would have to hard code the |
|
Yes, that might be the better solution. I'm not 100% sure if there might not be new issues with that approach, but it's worth trying. |
|
Maybe we run into some issue, but we'll see. Shall we keep working on this PR or a new open one? |
Do whatever you feel more comfortable with. |
|
I have written an implementation to test in my local setup where I have introduced a new argument to the I will remove unnecessary changes from the diff in future commits! |
I would prefer for the change to not be backwards incompatible, i.e. all existing tests should still pass. I have a suggestion, not sure if there is something I missed that would invalidate it, let me know: The |
That doesn't sound too bad to me, at some point we might want to deprecate passing a path to the model maybe. |
|
I totally agree with the implementation and that's why I decided to go ahead and implement it. |
|
Hi, I'm not sure what to do about the failing example in the docs build. Since |
|
Maybe I misunderstand your code, but it seems like it doesn't do what we discussed, resulting in the error? IIUC,
|
…f model is an estimator to create README
|
Thank you for pointing that out @BenjaminBossan, I realised the code wasn't doing what was discussed. I reverted back to the behaviour of creating a README if model is a str/Path, otherwise handle it if model is an estimator object. I believe the code should be doing that now. However, docs still fail to build on the example Mistakenly closed the PR while scrolling. |
|
Thanks for the updates @jucamohedano. This is not a full on review, as I think there are still some bigger changes needed. Most notably, I feel like
Maybe I misunderstand, but shouldn't it be the other way round? What I mean is that the README.md created by the example is the one that contains all the important information and should thus take precedence over the README.md that was automatically created. |
|
I will continue working on the issue starting next week and give some feedback on my work. Thank you for the feedback! |
|
Hey! Thanks for the suggestion about
Thanks for making it clear. The call to |
BenjaminBossan
left a comment
There was a problem hiding this comment.
Thanks for providing updates to bring this PR closer to be merged. I have made a few comments and had some questions for clarification, please take a look.
We could address this problem in hub_utils.add_files to account for the existing file created by the call to hub_utils.init.
add_files already has an option, exist_ok, which can be set to True to allow overriding.
| model = _load_model(model, trusted=True) | ||
| model_card = card.Card(model, metadata=card.metadata_from_config(dst)) | ||
| model_card.save(dst / "README.md") | ||
| elif isinstance(model, BaseEstimator): |
There was a problem hiding this comment.
I think we need no check for BaseEstimator here, so this can just be else and the error below can be removed. There could be valid models here that don't inherit from BaseEstimator. It is the user's responsibility to provide an sklearn-compatible model.
| if model_format == "auto": | ||
| model_format = "skops" | ||
| elif model_format in ["pkl", "pickle", "joblib"]: | ||
| model_format = "pickle" |
There was a problem hiding this comment.
We need an else clause below because if the user passes model_format="skosp" or another typo, that would just be accepted as is, even though it is invalid. Then further below, the model would not be saved without any indication that something went wrong.
Essentially, the model format checking logic should be the same as inside _create_config. And since checking the model format is new performed in init, we should also be able to remove it from _create_config completely.
There was a problem hiding this comment.
If so, I can remove that logic from _create_config and have it just in init. I actually started this PR by looking at _create_config. Let me know if you would like to make this change now.
There was a problem hiding this comment.
Yes, I think that change should be good.
|
|
||
| version = metadata.version("scikit-learn") | ||
| _, model_format = config_json | ||
| # joblib type falls unders auto format, explicityly set to auto |
There was a problem hiding this comment.
| # joblib type falls unders auto format, explicityly set to auto | |
| # joblib type falls under auto format, explicitly set to auto |
| version = metadata.version("scikit-learn") | ||
| _, model_format = config_json | ||
| # joblib type falls unders auto format, explicityly set to auto | ||
| # because we can't repeat key "auto" in CONFIG dict |
There was a problem hiding this comment.
Sorry, I don't understand that, could you please explain further?
There was a problem hiding this comment.
Sorry, that comment isn't self-explainable. The dictionary CONFIG in test_hf_hub contains 3 types of models. I wanted to test for a model with name model.joblib, so I added that model to the dictionary. If I recall correctly, adding the key:value pair for the joblib model made other tests fail. But I could fix this by changing the joblib model type auto and still have the name model.joblib to test for.
There was a problem hiding this comment.
Hmm, I guess it depends on why the other tests fail. If this uncovers a bug with existing code, that would be good to know. Otherwise, I think it's okay to have a separate test for joblib and not changing CONFIG.
| model_card = RepoCard.load(Path(dir_path) / "README.md") | ||
| model_card.data.license | ||
|
|
||
| # override existent modelcard created by init with license attribute |
There was a problem hiding this comment.
For my better understanding, is this testing some new behavior added by this PR or some general behavior?
There was a problem hiding this comment.
No, this is not a new behavior added by this PR. Now that you call this out I see that maybe we can make this test simpler by checking that the README.md exists after the call to init and that's it? I think that at the time of writing the test, I decided to update the README.md as a way of checking that it exists.
There was a problem hiding this comment.
we can make this test simpler by checking that the README.md exists after the call to init and that's it?
Yes, that sounds like it is sufficient as a test.
|
We can close this since we've deprecated |
This PR references issue #113
Implemented the function
_create_readmewhich is called inhub_utils.init. It creates a minimal README.md file. The input to this function is the same as the input to_create_configfunction.At the moment, the function
_create_readmeonly supports atabulartask for the widget. I haven't added thetexttask since other functions such asmetadata_from_configin_model_card.pydon't give support for thetexttask yet. Let me know if this can be added otherwise.I have manually tested the initialization of a repository with the example script
plot_hf_hub.pyand it seemed to work as expected.