Skip to content

feat: add json fts in python#5020

Merged
BubbleCal merged 2 commits intolance-format:mainfrom
wojiaodoubao:expose-lance-tokenizer-in-python-api
Oct 27, 2025
Merged

feat: add json fts in python#5020
BubbleCal merged 2 commits intolance-format:mainfrom
wojiaodoubao:expose-lance-tokenizer-in-python-api

Conversation

@wojiaodoubao
Copy link
Copy Markdown
Contributor

Related to #4749

@github-actions github-actions Bot added enhancement New feature or request python labels Oct 21, 2025
@wojiaodoubao wojiaodoubao force-pushed the expose-lance-tokenizer-in-python-api branch from a372b10 to d7a8cb1 Compare October 21, 2025 11:35
Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a plan to add some docs somewhere explaining how this works? Just from looking at the API I (knowing little about what changes we've done here) am a little confused by the difference between lance_tokenizer and base_tokenizer.

@wojiaodoubao
Copy link
Copy Markdown
Contributor Author

Do we have a plan to add some docs somewhere explaining how this works? Just from looking at the API I (knowing little about what changes we've done here) am a little confused by the difference between lance_tokenizer and base_tokenizer.

Hi @westonpace , yes we have a document for this: #4865

@BubbleCal
Copy link
Copy Markdown
Contributor

related to #4524

@wojiaodoubao wojiaodoubao force-pushed the expose-lance-tokenizer-in-python-api branch from d7a8cb1 to 87b8e74 Compare October 23, 2025 09:39
@wojiaodoubao
Copy link
Copy Markdown
Contributor Author

Hi @BubbleCal @westonpace , could you help review this when you have time, thanks very much!

@wojiaodoubao wojiaodoubao force-pushed the expose-lance-tokenizer-in-python-api branch from 87b8e74 to 8ecfdca Compare October 23, 2025 09:45
@BubbleCal
Copy link
Copy Markdown
Contributor

This seems not what I expected, why do we need the new param lance_tokenizer? my understanding is we can just use json tokenizer if the column is JSON, is it because we can't infer this from the column data type?

If so, can we just rename this param to content_type?

@wojiaodoubao wojiaodoubao force-pushed the expose-lance-tokenizer-in-python-api branch from 8ecfdca to ded279f Compare October 23, 2025 13:58
@wojiaodoubao wojiaodoubao changed the title feat: expose lance_tokenizer in python api feat: add json fts in python Oct 23, 2025
@wojiaodoubao
Copy link
Copy Markdown
Contributor Author

Hi @BubbleCal , thanks your suggestion! Inferring lance_tokenizer based on the storage type is a good idea, as it simplifies things at the API level.

Exposing lance_tokenizer (or content_type) would provide more flexibility. This means that if a column of data is stored as string instead of pa.json_(), but its content is JSON, you can still create a JSON full-text index by configuring lance_tokenizer.

Ultimately, I agree with the approach of using type inference. If we truly need flexibility in the future, we can revisit the idea of exposing content_type.

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 81.35593% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.72%. Comparing base (cd910bf) to head (ded279f).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...x/src/scalar/inverted/tokenizer/lance_tokenizer.rs 60.86% 8 Missing and 1 partial ⚠️
rust/lance-index/src/scalar/inverted/builder.rs 66.66% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5020      +/-   ##
==========================================
+ Coverage   81.65%   81.72%   +0.07%     
==========================================
  Files         340      340              
  Lines      138551   138615      +64     
  Branches   138551   138615      +64     
==========================================
+ Hits       113138   113290     +152     
+ Misses      21645    21578      -67     
+ Partials     3768     3747      -21     
Flag Coverage Δ
unittests 81.72% <81.35%> (+0.07%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@wojiaodoubao
Copy link
Copy Markdown
Contributor Author

I updated the patch following suggestion. Failed ut is unrelated. Hi @BubbleCal, please let me know your thoughts, thanks very much!

Copy link
Copy Markdown
Contributor

@BubbleCal BubbleCal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@BubbleCal BubbleCal merged commit 5958ede into lance-format:main Oct 27, 2025
26 of 27 checks passed
jackye1995 pushed a commit to jackye1995/lance that referenced this pull request Jan 21, 2026
Related to lance-format#4749

---------

Co-authored-by: lijinglun <lijinglun@bytedance.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants