Skip to content

Added support for Structured Arrays#211

Closed
aliosmankaya wants to merge 11 commits intoskops-dev:mainfrom
aliosmankaya:arr-support
Closed

Added support for Structured Arrays#211
aliosmankaya wants to merge 11 commits intoskops-dev:mainfrom
aliosmankaya:arr-support

Conversation

@aliosmankaya
Copy link
Copy Markdown

Added support for Structured Arrays to _get_column_names function. So, with that updates data could be:

  • data = np.array([(1, 2), (3, 4)], dtype=[('foo', 'i8'), ('bar', 'f4')]) # ["foo", "bar"]
  • data = np.zeros(3, dtype='int8, float32, float64') # ["f0", "f1", "f2"] (Default names)
  • data = np.zeros(3) # ["x0"]

@adrinjalali
Copy link
Copy Markdown
Member

Thanks for the PR @aliosmankaya .

Could you please also add tests for the cases you're adding support for?

We should also make sure when the user uses these types, the inference side of things work. Which means we should test that as well, and figure out how to pass the data to the model on the api-inference-community side.

@aliosmankaya
Copy link
Copy Markdown
Author

@adrinjalali can you review the tests?

@adrinjalali
Copy link
Copy Markdown
Member

Those are nice tests, thanks @aliosmankaya

What I'm wondering, is what would happen if we train a scikit-learn model, persist it, push it to the hub, and try to get inference from that model using hug_utils.get_model_output.

@aliosmankaya
Copy link
Copy Markdown
Author

@adrinjalali you mean, you want to use _get_column_names in get_model_output ?

If so, I reviewed the function and added support to fix structured array bug may be in the future.
If not, can you explain in more detail?

@adrinjalali
Copy link
Copy Markdown
Member

Here's an end to end test:

  • use structured arrays to train a model
  • push that model to the hub with a valid metadata and README
  • try to use the same data to get model output from the hub from that model

Note that on the hub side, things are handled by this repo: https://github.com/huggingface/api-inference-community/tree/main/docker_images/sklearn , so we might need to fix thing on that side as well.

@aliosmankaya
Copy link
Copy Markdown
Author

Hi @adrinjalali, sorry for the late response.

To test the hub-side, I pushed a Linear Regression model that trained on iris dataset that converted to structured array.
figure1
Then, when I try the get_model_output with my x_test, it throws me an error like TypeError: Object of type ndarray is not JSON serializable.
So I thought the get_model_output won't work with a structured array and then I fix it with my last commit. Also added support for 1-dimensional structured arrays too. Examples like:

  • X_array = np.zeros(3, dtype="int8, float32, float64")
  • X_array = np.array([[(6.1,)], [(5.5,)]], dtype=[('col1', '<f8')])

Can you review the commit?

col: (data[:, idx] if len(data.shape) > 1 else data[idx])
for idx, col in enumerate(_get_column_names(data))
}
if data.dtype.names:
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With that condition, we can figure out the data is a structured array or a regular numpy array.

Comment thread skops/hub_utils/_hf_hub.py Outdated
if data.dtype.names:
inputs = {
col: (
[(x[0] if len(data.shape) > 1 else x) for x in data[idx].tolist()]
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For converted the structured array data to a list with iteration, check the data shape for index operation.

@merveenoyan
Copy link
Copy Markdown
Collaborator

@aliosmankaya the error is due to likely how the inference in api-inference-community doesn't support ndarray. (my guess, at least)
Would you like to fix that as well? I can help you out to get a local build to test on that side if you want.

@aliosmankaya
Copy link
Copy Markdown
Author

@merveenoyan sure, I'll look.

@aliosmankaya
Copy link
Copy Markdown
Author

Hey @adrinjalali @merveenoyan , to expanded the test, I added new models to my account . Thus I tested structured and regular numpy arrays both.

On the test side, I used InferenceAPI from hugginface_hub and api-inference-community inference with local build.

#Test1: model

data = np.array([1, 2])
# inputs condition with the changes...
inputs = {"data": {"x0": [1, 2]}}
InferenceApi(repo_id="aliosmankaya/reg_arr_model_1_dim")(inputs)

#Test2: model

data = np.array([[1, 2], [3, 4]])
# inputs condition with the changes...
inputs = {"data": {"x0": [1, 3], "x1": [2, 4]}}
InferenceApi(repo_id="aliosmankaya/reg_arr_model_2_dim")(inputs)

#Test3: model

data = np.array([(6.1,), (5.5,)], dtype=[("col1", "<f8")])
# inputs condition with the changes...
inputs = {"data": {"x0": [6.1, 5.5]}}
InferenceApi(repo_id="aliosmankaya/structured_arr_model_1_dim")(inputs)

So the point is, we prepare inputs from here and used by there . That means, we don't need to fix anything on the api-inference-community for get_model_output because we prepare inputs in skops. What do you think?

@merveenoyan
Copy link
Copy Markdown
Collaborator

merveenoyan commented Dec 2, 2022

@aliosmankaya as we talked on twitter, it looks good to me. I think it's good that you make the necessary changes in skops side instead of api-inference-community, but I'd like to hear what others say.
(also we're on discord: hf.co/join/discord if you'd like to reach out to us easier 😅)

@adrinjalali
Copy link
Copy Markdown
Member

adrinjalali commented Dec 5, 2022

Thanks for the work @aliosmankaya . A few things before I could review this:

The feature names you're using are the same as the autogenerated ones on the inference side.

Could you please:

  • create a repo with a model where the model is trained on structured array input and the feature names are non-trivial
  • upload the model to the hub under your account with a valid README.md file so that the widget on the repo is enabled
  • test that using the widget, as well as calling the inference API with skops.hub_utils.get_model_output give the same output as loading the model locally.

Also, for reproducibility, could you please include the code generating the repo / model in the repo you create?

@aliosmankaya
Copy link
Copy Markdown
Author

hey @adrinjalali , I just updated the models and also uploaded its codes. Can you review it again? Also, models works on widget as well.

@adrinjalali
Copy link
Copy Markdown
Member

@aliosmankaya sorry but I can't find the changes you're referring to. The links you have in this comment (#211 (comment)) still have models which have x0, ... as input variables.

@adrinjalali
Copy link
Copy Markdown
Member

Note that the issue might be that sklearn is not detecting feature names from a structured array (that might be very well the case), in which case, then we can't support them either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants