feat: tokenize endpoints + property-level TextAnalyzer + StopwordPresets (Weaviate 1.37.0+)#329
Merged
mpartipilo merged 6 commits intomainfrom May 5, 2026
Merged
Conversation
There was a problem hiding this comment.
Orca Security Scan Summary
| Status | Check | Issues by priority | |
|---|---|---|---|
| Infrastructure as Code | View in Orca | ||
| SAST | View in Orca | ||
| Secrets | View in Orca | ||
| Vulnerabilities | View in Orca |
Summary - Weaviate C# Client CoverageSummary
CoverageWeaviate.Client - 49.2%
Weaviate.Client.Analyzers - 0%
Weaviate.Client.VectorData - 50.3%
|
Port of python-client PR #2012, aligned with the TS client's `tokenize`
namespace design. Adds:
- `client.Tokenize.Text(text, tokenization, analyzerConfig?, stopwordPresets?)`
→ POST /v1/tokenize
- `collection.Tokenize.Property(propertyName, text)`
→ POST /v1/schema/{class}/properties/{prop}/tokenize
Version-gated at 1.37.0 via `[RequiresWeaviateVersion]`. `AsciiFold` is
modeled as a nullable record (null = disabled, non-null = enabled with
optional `Ignore` list) so the invalid "ignore without fold" state is
unrepresentable without a validator.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- New docs/TOKENIZE_API_USAGE.md covers both `client.Tokenize.Text` and `collection.Tokenize.Property`, analyzer config (ASCII folding, stopwords), the result shape, and common usage patterns. - Link the guide from README under "Additional Guides". - Add an "Unreleased" CHANGELOG entry for the tokenize endpoints. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Port weaviate-python-client PR #2006 on top of the tokenize-endpoint stack for Weaviate 1.37.0: - Property.TextAnalyzer: pin ASCII folding and stopword preset per property at index time. Reuses the TextAnalyzerConfig record already introduced for /v1/tokenize so tokenize-at-query and index-at-insert stay aligned. Propagates through nested properties via Property-> NestedProperties recursion. - InvertedIndexConfig.StopwordPresets: named preset->word-list map on the collection inverted-index config. Properties reference presets via TextAnalyzer.StopwordPreset. Round-trips through create + update. - InvertedIndexConfigUpdate.StopwordPresets: mirrors the set accessor on the update wrapper so c.InvertedIndexConfig.StopwordPresets = ... works inside collection.Config.Update(...). - Preflight in CollectionsClient.Create: detects either feature in the incoming schema and throws WeaviateVersionMismatchException when the connected server is older than 1.37.0, before any REST call. - Rename TokenizeAnalyzerConfig -> TextAnalyzerConfig: same shape now serves both the tokenize endpoint and the property-level analyzer, matching the server type name and Python naming. - Integration tests in TestCollectionTextAnalyzer.cs cover preset round-trip, update, referenced-removal rejection, ascii-fold combos, and version-gate behaviour. - CHANGELOG + docs/TOKENIZE_API_USAGE.md extended with worked examples for the schema-side analyzer and stopword presets. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…pwordPresets rejections The `StopwordPresets_RemoveInUse_RejectedByServer` and `StopwordPresets_RemoveReferencedByNested_RejectedByServer` tests expected `WeaviateClientException`, but the server returns HTTP 422 which the client maps to `WeaviateUnprocessableEntityException : WeaviateServerException`. The test names already indicate these are server-side rejections — align the assertions with the actual (and correct) exception type. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Resyncs openapi.json from server tag v1.37.2 and regenerates DTOs. Fixes the TokenizeRequest DTO: stopwordPresets changes from IDictionary<string, StopwordConfig> to IDictionary<string, IList<string>> (flat word lists) and gains the new stopwords one-off block field. Simplifies TokenizeResponse (echo fields removed by server). Public API changes: - TokenizeClient.Text gains a stopwords: StopwordConfig? param (one-off block, mutually exclusive with stopwordPresets) - stopwordPresets type corrected to IDictionary<string, IList<string>> - TokenizeResult drops Tokenization / AnalyzerConfig / StopwordConfig echo properties (server no longer returns them) Tests updated accordingly: CustomPreset_Additions uses flat word lists, CustomPreset_BaseAndRemovals rewritten as Stopwords_OneOff_BaseAndRemovals using the new stopwords param, echo-field assertions removed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
6043bb1 to
1e00804
Compare
g-despot
approved these changes
May 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Ports two related Weaviate 1.37.0 features from the Python client into one stacked PR:
python-client PR #2012 —
/v1/tokenizeendpointsclient.Tokenize.Text(text, tokenization, analyzerConfig?, stopwordPresets?, ct)(POST /v1/tokenize)collection.Tokenize.Property(propertyName, text, ct)(POST /v1/schema/{class}/properties/{prop}/tokenize)tokenizenamespace design.AsciiFoldis a nullable record (AsciiFoldConfig? AsciiFold) — null = disabled, non-null = enabled with optionalIgnorelist. The invalid "ignore without fold" state is unrepresentable, so no runtime validator is needed.[RequiresWeaviateVersion(1, 37, 0)]+EnsureVersion<T>().python-client PR #2006 — property-level
TextAnalyzer+ collection-levelStopwordPresetsProperty.TextAnalyzer: pin ASCII folding and stopword preset per property at index time. Reuses theTextAnalyzerConfigrecord from (1) so tokenize-at-query and index-at-insert stay aligned. Propagates through nested properties.InvertedIndexConfig.StopwordPresets: named preset → word-list map on the collection inverted-index config. Properties reference presets viaTextAnalyzer.StopwordPreset.InvertedIndexConfigUpdate.StopwordPresets: mirrors the set accessor on the update wrapper soc.InvertedIndexConfig.StopwordPresets = ...works insidecollection.Config.Update(...).CollectionsClient.Createdetects either feature in the incoming schema and throwsWeaviateVersionMismatchExceptionwhen the server is older than 1.37.0, before any REST call.TextAnalyzerConfigshape and version gate.TokenizeAnalyzerConfigwas renamed toTextAnalyzerConfigto match the server type name.Docs: TOKENIZE_API_USAGE.md — end-to-end guide covering both scopes, including schema-time analyzer + preset examples.
Out of scope
gse_chfix — separate tracked work.Test plan
dotnet build src/Weaviate.Client/→ 0 errorsdotnet build src/Weaviate.Client.Tests/→ 0 warnings, 0 errorsdotnet test --filter FullyQualifiedName~TestTokenize→ 16/16 passed against Weaviate 1.37.1dotnet test --filter FullyQualifiedName~TestCollectionTextAnalyzeragainst Weaviate 1.37.1🤖 Generated with Claude Code