feat: add json fts in python#5020
Conversation
a372b10 to
d7a8cb1
Compare
westonpace
left a comment
There was a problem hiding this comment.
Do we have a plan to add some docs somewhere explaining how this works? Just from looking at the API I (knowing little about what changes we've done here) am a little confused by the difference between lance_tokenizer and base_tokenizer.
Hi @westonpace , yes we have a document for this: #4865 |
|
related to #4524 |
d7a8cb1 to
87b8e74
Compare
|
Hi @BubbleCal @westonpace , could you help review this when you have time, thanks very much! |
87b8e74 to
8ecfdca
Compare
|
This seems not what I expected, why do we need the new param If so, can we just rename this param to |
8ecfdca to
ded279f
Compare
|
Hi @BubbleCal , thanks your suggestion! Inferring lance_tokenizer based on the storage type is a good idea, as it simplifies things at the API level. Exposing lance_tokenizer (or content_type) would provide more flexibility. This means that if a column of data is stored as string instead of pa.json_(), but its content is JSON, you can still create a JSON full-text index by configuring lance_tokenizer. Ultimately, I agree with the approach of using type inference. If we truly need flexibility in the future, we can revisit the idea of exposing content_type. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #5020 +/- ##
==========================================
+ Coverage 81.65% 81.72% +0.07%
==========================================
Files 340 340
Lines 138551 138615 +64
Branches 138551 138615 +64
==========================================
+ Hits 113138 113290 +152
+ Misses 21645 21578 -67
+ Partials 3768 3747 -21
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
I updated the patch following suggestion. Failed ut is unrelated. Hi @BubbleCal, please let me know your thoughts, thanks very much! |
Related to lance-format#4749 --------- Co-authored-by: lijinglun <lijinglun@bytedance.com>
Related to #4749