Cache model loading in model card by jucamohedano · Pull Request #299 · skops-dev/skops

jucamohedano · 2023-02-12T20:54:14Z

Implementation of cache model loading discussed in issue #243

This PR includes the following changes:

Add new argument to _load_model function to implement model caching
New function hash_model that is used as a decorator on _load_model function
Use of lru_cache decorator to implement caching on top of hash_model and _load_model functions
Extended test_load_model to test for cache model loading
test_hash_model test implemented to test hash_model function

BenjaminBossan · 2023-02-13T11:44:43Z

Thank you for taking this issue. I'm unsure about the implementation here, which I admit is not a trivial matter. If I understand correctly, you're using an lru_cache on _load_model and then hash_model to insert an argument into the function's list of args that is actually the hash of the function.

To me, this seems to be a bit "hacky" and I would like to suggest a different approach. In my suggestion, we would leave _load_model as is and instead do the caching on the Card object. I think it fits better there conceptually -- the cached model belongs to a specific card, whereas the _load_model function could be used in different contexts (e.g. imagine someone working on two model cards in the same program).

I did a quick and dirty implementation of how it could look like (in _model_card.py):

...
from hashlib import sha256
from functools import cached_property
...

class Card:
    def __init__(...):
        ...
        self._model_hash = ""

        self._populate_template()

    def get_model(self) -> Any:
        """..."""
        if isinstance(self.model, (str, Path)) and hasattr(self, "_model"):
            hash_obj = sha256()
            buf_size = 2 ** 20  # load in chunks to save memory
            with open(self.model, "rb") as f:
                for chunk in iter(lambda: f.read(buf_size), b""):
                    hash_obj.update(chunk)
            model_hash = hash_obj.hexdigest()

            # if hash changed, invalidate cache by deleting attribute
            if model_hash != self._model_hash:
                del self._model
                self._model_hash = model_hash

        return self._model

    @cached_property
    def _model(self):
        model = _load_model(self.model, self.trusted)
        return model

What do you think about that? Of course, it would require some comments and tests, but I hope you get the general idea.

However, there's a small problem. I worked on a previous PR 207 which is supposed to be merged, but I worked on this PR in the main branch of my fork (my mistake). Therefore, I had to branch off from main to implement cache model loading and now all my previous commits are attached to 299. I'm sorry if that's an issue.

It's okay, the diff is still shown correctly, right? However, please make sure to correct this for the next PR.

jucamohedano · 2023-02-26T16:23:40Z

I understand your approach and I agree that it fits better conceptually. I took your implementation because it works out of the box, no errors. I wrote a test for it for which I would appreciate some feedback. Thanks a lot for the help on this PR!

BenjaminBossan · 2023-02-27T14:40:18Z

Thanks for the updates. I haven't done a proper review yet, but I saw that some changes were unrelated to the additions of this PR. Could you please clean those up? Maybe those were changed by your IDE automatically?

Also, it seems that there are black errors, could you please set up the pre-commit hooks as described here?

Finally, the docs are not building. I think it's the same issue as in #207, so whatever fixes that should work here too.

jucamohedano · 2023-02-28T22:48:32Z

Thank you for your comments. I'm in the process of fixing the docs error, I have asked a question about that in #207

adrinjalali · 2023-03-03T07:00:54Z

There's a merge conflict here, and the CI hasn't run completely somehow, could you please merge with upstream/main and push again?

…oading

BenjaminBossan

Thanks for updating the PR. It still shows a lot of unrelated changes in the diff, could you please remove them?

Regarding the test, I have to admit I don't quite understand it. For example, what does this test?

assert str(card._model_hash) == card.__dict__["_model_hash"]

I think what I would like to see is a more high level test, i.e. nothing that involves any hashes, since those are implementation details. One way would be to mock _load_model and assert that, when card.get_model() is called, _load_model is only called once, the first time, and after that it's not called anymore. Then, only when the underlying model is overwritten, should it be called again. WDYT?

jucamohedano · 2023-05-01T19:01:46Z

Hey! Sorry it's been a while.

Now that I look back at the test I wrote I think that the line you highlighted is not testing anything, I'm not sure what I was thinking at that time.

I agree with your proposal to check that everything works rather than checking the details as I was trying to do. I'm happy to update the test.

I'm not sure how to get rid of all of the unrelated changes in the diff. I have modified 3 files of the 7, the other changes were applied after I ran pre-commit manually on all files. I will merge with upstream/main again and see if I can get rid of them.

BenjaminBossan · 2023-05-02T10:03:07Z

Now that I look back at the test I wrote I think that the line you highlighted is not testing anything, I'm not sure what I was thinking at that time.

I agree with your proposal to check that everything works rather than checking the details as I was trying to do. I'm happy to update the test.

Okay, thanks for clearing that up.

I'm not sure how to get rid of all of the unrelated changes in the diff. I have modified 3 files of the 7, the other changes were applied after I ran pre-commit manually on all files. I will merge with upstream/main again and see if I can get rid of them.

Let's see if that works. Otherwise, in the worst case, you could try opening a new PR based on the latest main with the same changes, that should hopefully work. For me, it's not quite clear if merging the PR would also merge those unrelated diffs (which we want to avoid) or if they're just displayed on GitHub but merging will not actually change those lines.

jucamohedano · 2023-05-02T19:02:18Z

If we go about mocking _load_model function we have to install the pytest-mock plugin. Would that be okay @BenjaminBossan ? I have written the test with pytest-mock in case that's fine

BenjaminBossan · 2023-05-03T09:35:59Z

If we go about mocking _load_model function we have to install the pytest-mock plugin. Would that be okay @BenjaminBossan ? I have written the test with pytest-mock in case that's fine

Are you sure that we need pytest-mock? Personally, I never needed that package, as unittest.mock from the standard library and pytest's monkeypatch were more than enough for my needs. If you search for "mock" and "patch" in skops, you'll find a couple of examples.

…oading

…nto cache-model-loading

jucamohedano · 2023-05-03T21:03:05Z

Ah okay! I have implemented it using unittest.mock and I think it's enough with it. Let me know what you think.

I have also got rid of the files in the diff that weren't supposed to be there. I reverted the changes of those files.

BenjaminBossan

Well done with reverting the changes, this is now much easier to review, thanks.

There isn't much work left. Regarding the test, I think it can be improved a bit, please take a look at my suggestion. Other than that, please add an entry to docs/changes.rst. Then this should be good to go.

The failing CI job is unrelated to this PR, so please ignore it.

BenjaminBossan · 2023-05-04T09:50:33Z

+    # _load_model get called
+    card = Card(iris_skops_file, metadata=metadata_from_config(destination_path))
+    with mock.patch("skops.card._model_card._load_model") as mock_load_model:
+        model1 = card.get_model()
+        model2 = card.get_model()
+        assert model1 is model2
+        # model is cached, hence _load_model is not called
+        mock_load_model.assert_not_called()
+        # update card with new model
+        new_model = LogisticRegression()
+        _, save_file = save_model_to_file(new_model, ".skops")
+        del card.model
+        card.model = save_file
+        model3 = card.get_model()  # model gets cached
+        model4 = card.get_model()
+        assert model3 is model4
+        assert mock_load_model.call_count == 1


I see the intent with this test, but I think it's problematic that del card.model and card.model = save_file are being used. As a skops user, I wouldn't do that and I would still expect the cached model loading to work correctly. Therefore, I made some changes to the test so that these lines are not needed:

Suggested change

# _load_model get called

card = Card(iris_skops_file, metadata=metadata_from_config(destination_path))

with mock.patch("skops.card._model_card._load_model") as mock_load_model:

model1 = card.get_model()

model2 = card.get_model()

assert model1 is model2

# model is cached, hence _load_model is not called

mock_load_model.assert_not_called()

# update card with new model

new_model = LogisticRegression()

_, save_file = save_model_to_file(new_model, ".skops")

del card.model

card.model = save_file

model3 = card.get_model() # model gets cached

model4 = card.get_model()

assert model3 is model4

assert mock_load_model.call_count == 1

new_model = LogisticRegression(random_state=4321)

# mock _load_model, it still loads the model but we can track call count

mock_load_model = mock.Mock(side_effect=load)

card = Card(iris_skops_file, metadata=metadata_from_config(destination_path))

with mock.patch("skops.card._model_card._load_model", mock_load_model):

model1 = card.get_model()

model2 = card.get_model()

assert model1 is model2

# model is cached, hence _load_model is not called

mock_load_model.assert_not_called()

# override model with new model

dump(new_model, card.model)

model3 = card.get_model()

assert mock_load_model.call_count == 1

assert model3.random_state == 4321

model4 = card.get_model()

assert model3 is model4

assert mock_load_model.call_count == 1 # cached call

(line 3: load needs to be imported from skops.io)

This test is similar to yours but is closer to how a user would actually use the model card. Please take a look and see if you agree with me. It would also be good to have a comment at the start of the test to explain what is being tested here.

I can see how the user would go with your approach first, rather than mine. Definitely, I agree with your suggestion.

Tested your suggestion and it passes the test as expected. I also added a short comment to describe what the function tests at the beginning of it.

BenjaminBossan

Thx. This LGTM. @adrinjalali not sure if you want to review too, if not feel free to merge.

adrinjalali

just a nit, otherwise LGTM.

adrinjalali · 2023-05-08T15:59:31Z


    def get_model(self) -> Any:
        """Returns sklearn estimator object.
-


please put back this line

adrinjalali · 2023-05-08T15:59:36Z

-
        If the ``model`` is already loaded, return it as is. If the ``model``
        attribute is a ``Path``/``str``, load the model and return it.
-


this line too

Reverted them! sorry about that, I will pay attention to that next time.

jucamohedano and others added 27 commits November 1, 2022 22:04

feat: generate README.md in hub_utils.init

f5e83a9

Merge branch 'skops-dev:main' into main

28b4b0c

ref: replace _create_readme function with fewer lines

f0e9683

test create model card in hub_utils.init

1aeb14c

test override model card after created by hub_utils.init

95c0e1b

Merge branch 'skops-dev:main' into main

e0e6c7d

ref: deduplicate test creation of README in init

4b6cb73

fix: check that content of new model card is modified

870797f

Merge branch 'skops-dev:main' into main

d3a0eac

Merge branch 'main' of github.com:jucamohedano/skops into main

4b3fb8d

Merge branch 'skops-dev:main' into main

eaed93b

revert lines removed by mistake

f182ee1

Merge branch 'main' into main

9a41cf2

Merge branch 'skops-dev:main' into main

7f7d0c2

Merge branch 'skops-dev:main' into main

1c19795

Merge branch 'skops-dev:main' into main

56165e4

Merge branch 'main' into main

6f99565

Merge branch 'main' into main

c265f50

fix: check model format of model file

0b6d3e2

fix: run pre-commit on all files

5e1494a

Merge branch 'skops-dev:main' into main

5cfa962

Merge branch 'skops-dev:main' into main

35a30c2

fix: check for file suffix to determine format

0c4a66f

Merge branch 'skops-dev:main' into main

5436894

feat: implement model caching with sha256 hash

2171908

feat: extend test to test cache model loading

36b855b

add rest of suffixes in test_hash_model

81523ac

jucamohedano mentioned this pull request Feb 12, 2023

Cache model loading in model card #243

Open

jucamohedano changed the title ~~Cache model loading~~ Cache model loading in model card Feb 12, 2023

fix: cache model within the model card object

c8e9281

jucamohedano added 2 commits February 28, 2023 23:22

fix: run pre-comit on all files

3c2151a

ref: remove additional unrelated code

5af9b4a

jucamohedano added 2 commits March 13, 2023 01:06

Merge remote-tracking branch 'skops-upstream/main' into cache-model-l…

e40a4d3

…oading

run precommit on all files and apply fixes

1245ff3

adrinjalali requested a review from BenjaminBossan April 3, 2023 08:26

BenjaminBossan requested changes Apr 3, 2023

View reviewed changes

jucamohedano added 5 commits May 3, 2023 22:44

fix test_model_caching with a higher level test

ff6d3b9

Merge remote-tracking branch 'skops-upstream/main' into cache-model-l…

c1ba293

…oading

Merge branch 'cache-model-loading' of github.com:jucamohedano/skops i…

24c014d

…nto cache-model-loading

revert changes to origin

dd2eb68

revert changes to origin

7826c4b

BenjaminBossan requested changes May 4, 2023

View reviewed changes

apply and test suggestion

1e1237c

BenjaminBossan approved these changes May 8, 2023

View reviewed changes

adrinjalali reviewed May 8, 2023

View reviewed changes

jucamohedano added 2 commits May 8, 2023 20:52

revert lines

1879a3e

revert lines

78207d9

adrinjalali merged commit 8a58101 into skops-dev:main May 9, 2023


		def get_model(self) -> Any:
		"""Returns sklearn estimator object.


		If the ``model`` is already loaded, return it as is. If the ``model``
		attribute is a ``Path``/``str``, load the model and return it.

Conversation

jucamohedano commented Feb 12, 2023

Uh oh!

BenjaminBossan commented Feb 13, 2023

Uh oh!

jucamohedano commented Feb 26, 2023

Uh oh!

BenjaminBossan commented Feb 27, 2023

Uh oh!

jucamohedano commented Feb 28, 2023

Uh oh!

adrinjalali commented Mar 3, 2023

Uh oh!

BenjaminBossan left a comment

Choose a reason for hiding this comment

Uh oh!

jucamohedano commented May 1, 2023

Uh oh!

BenjaminBossan commented May 2, 2023

Uh oh!

jucamohedano commented May 2, 2023

Uh oh!

BenjaminBossan commented May 3, 2023

Uh oh!

jucamohedano commented May 3, 2023

Uh oh!

BenjaminBossan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BenjaminBossan May 4, 2023

Choose a reason for hiding this comment

Uh oh!

jucamohedano May 8, 2023

Choose a reason for hiding this comment

Uh oh!

jucamohedano May 8, 2023

Choose a reason for hiding this comment

Uh oh!

BenjaminBossan left a comment

Choose a reason for hiding this comment

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

adrinjalali May 8, 2023

Choose a reason for hiding this comment

Uh oh!

adrinjalali May 8, 2023

Choose a reason for hiding this comment

Uh oh!

jucamohedano May 8, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

BenjaminBossan left a comment •

edited

Loading