Skip to content

Add support for multilingual-e5-small: Fixes#123#190

Closed
Ya-shh wants to merge 8 commits intoqdrant:mainfrom
Ya-shh:e5_small
Closed

Add support for multilingual-e5-small: Fixes#123#190
Ya-shh wants to merge 8 commits intoqdrant:mainfrom
Ya-shh:e5_small

Conversation

@Ya-shh
Copy link
Copy Markdown

@Ya-shh Ya-shh commented Apr 7, 2024

@review-notebook-app
Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@Ya-shh
Copy link
Copy Markdown
Author

Ya-shh commented Apr 7, 2024

@Anush008 could you please review this ?

@Ya-shh
Copy link
Copy Markdown
Author

Ya-shh commented Apr 7, 2024

@Anush008 The file fastembed/tests/test_text_onnx_embeddings.py is indicating a need for different canonical vector values. I think we need to adjust that test, as we encountered a similar issue previously with the e5-large-instruct model (#181). When I updated those values, all tests passed. Here's the logged error from that instance:
E AssertionError: intfloat/multilingual-e5-small
E assert False
E + where False = <function allclose at 0x101d59d30>(array([ 0.04931236, 0.02415175, -0.0384715 , -0.08884481, 0.08710264],\n dtype=float32), array([ 0.03131689, 0.03093922, -0.03511665, -0.06727391, 0.08508426]), atol=0.001)
E + where <function allclose at 0x101d59d30> = np.allclose

As this model is already converted to onnx on HF:https://huggingface.co/intfloat/multilingual-e5-small/tree/main/onnx so the canonical vector values of this model can't be incorrect

@Anush008
Copy link
Copy Markdown
Member

Anush008 commented Apr 7, 2024

The CI logs a different issue though.
https://github.com/qdrant/fastembed/actions/runs/8590817987/job/23538844967#step:5:2066

Due to

def _preprocess_onnx_input(self, onnx_input: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]:
"""
Preprocess the onnx input.
"""
onnx_input.pop("token_type_ids", None)
return onnx_input

@Ya-shh
Copy link
Copy Markdown
Author

Ya-shh commented Apr 7, 2024

Yes, I agree with you. Interestingly, after updating the atol values, all the tests passed successfully.:platform darwin Python 3.10.10, pytest-8.1.1, pluggy-1.4.0
rootdir: /Users/yash/PycharmProjects/e5-instruct
plugins: anyio-4.3.0, asyncio-0.23.6
asyncio: mode=strict
collected 8 items

fastembed/tests/test_sparse_embeddings.py ... [ 37%]
fastembed/tests/test_text_onnx_embeddings.py ..... [100%]

@Ya-shh
Copy link
Copy Markdown
Author

Ya-shh commented Apr 7, 2024

I think the issue might be due to the atol values in the test.

@Ya-shh Ya-shh marked this pull request as draft April 8, 2024 04:45
@Ya-shh Ya-shh changed the title Added support for multilingual-e5-small: Fixes#123 Add support for multilingual-e5-small: Fixes#123 Apr 8, 2024
@Ya-shh Ya-shh closed this Apr 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants