fix: tokenize doc + test alignment for 1.37.2#335
fix: tokenize doc + test alignment for 1.37.2#335mpartipilo merged 3 commits intofeat/tokenize-endpointfrom
Conversation
- TOKENIZE_API_USAGE.md: correct Tokenize.Text signature (adds stopwords param, fixes stopwordPresets type to IDictionary<string, IList<string>>), rewrite stopwords examples to use flat word lists, fix Property samples (DataType, PropertyTokenization, Use vs Get), drop fictional result fields (Tokenization / AnalyzerConfig / StopwordConfig) and fix the Result Shape table to match the actual record (Indexed + Query only). - TestTokenize.Tokenization_Enum: pass stopwords: None explicitly so the test no longer depends on server defaults (1.37.2 auto-applies the EN preset for Word tokenization). - CI: run on 1.37.2 instead of 1.37.1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Orca Security Scan Summary
| Status | Check | Issues by priority | |
|---|---|---|---|
| Secrets | View in Orca |
The "from 1.37.2 onward" wording suggested the default-EN behavior was introduced at that version; I haven't verified that against earlier releases. The openapi spec just documents it as the current default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Verified locally against 1.37.2 with both dotnet test and a raw curl to /v1/tokenize: the EN-preset default strips "the" from result.Query only — result.Indexed always keeps it. The earlier comment said both lists were affected, which was wrong.
There was a problem hiding this comment.
Pull request overview
Aligns the Weaviate C# client docs and integration tests with Weaviate server behavior as of v1.37.2, particularly around tokenize stopword defaults and the current SDK surface area.
Changes:
- Updated
docs/TOKENIZE_API_USAGE.mdto match the realTokenize.Textsignature/result shape and current schema/property APIs (Collections.Use,PropertyTokenization,DataType = DataType.Text, stopwords inputs). - Made
TestTokenize.Tokenization_Enumdeterministic on Weaviate v1.37.2 by explicitly disabling stopwords for the enum-case assertions. - Bumped the CI integration test matrix entry from Weaviate
1.37.1to1.37.2.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| src/Weaviate.Client.Tests/Integration/TestTokenize.cs | Stabilizes the enum tokenization integration test by explicitly setting stopwords behavior. |
| docs/TOKENIZE_API_USAGE.md | Corrects and modernizes tokenize usage docs (signature, result fields, stopwords shapes, schema samples). |
| .github/workflows/main.yaml | Updates CI to test against Weaviate 1.37.2. |
| // the server defaults Word tokenization to the EN preset, which strips | ||
| // "the" from result.Query and would break the assertion below. |
There was a problem hiding this comment.
The comment says the default EN stopword preset would strip "the" from both Indexed and Query. In this client/tests (and Weaviate’s stopword semantics), stopwords are typically removed from Query only while Indexed keeps them. Consider rewording to avoid implying Indexed changes as well, since that can mislead future maintenance/debugging of these assertions.
| // the server defaults Word tokenization to the EN preset, which strips | |
| // "the" from result.Query and would break the assertion below. | |
| // the server defaults Word tokenization to the EN preset, which removes | |
| // "the" from result.Query while leaving result.Indexed unchanged. |
Summary - Weaviate C# Client CoverageSummary
CoverageWeaviate.Client - 48.8%
Weaviate.Client.Analyzers - 0%
Weaviate.Client.VectorData - 50.3%
|
Summary
docs/TOKENIZE_API_USAGE.md: correctTokenize.Textsignature, replace fictionalresult.Tokenization/AnalyzerConfig/StopwordConfigfields with the realIndexed+Query, rewrite stopwords examples to use flatIList<string>values (theStopwordConfig-as-value shape doesn't compile), fixPropertysamples (DataType = DataType.Text,PropertyTokenization = ...), and useCollections.Use(...)instead of the non-existentCollections.Get(...).TestTokenize.Tokenization_Enumdeterministic by passingstopwords: { Preset = None }; from 1.37.2 onwardWordtokenization auto-applies the EN preset, which strips "the" from bothIndexedandQuery.Test plan
dotnet build src/Weaviate.Client/→ 0 errorsdotnet test --filter Unit→ 820 / 0 / 2dotnet test --filter "TestTokenize|TestCollectionTextAnalyzer"against Weaviate 1.37.2 → 23 / 0 / 2 (the 2 skips are the version-gate negative tests, expected on a 1.37+ server)🤖 Generated with Claude Code