Skip to content

Conversation

@absurdfarce
Copy link
Collaborator

Problem appeared to be that "vector" was added as a keyword but wasn't specified as an unreserved keyword. Added it to the parsers collection of native type names which (a) also makes it an unreserved keyword and (b) seems more correct on it's face anyway.

@absurdfarce
Copy link
Collaborator Author

absurdfarce commented Jun 25, 2024

Possibly useful for testing purposes:

CREATE KEYSPACE test
  WITH REPLICATION = { 
   'class' : 'SimpleStrategy', 
   'replication_factor' : 1 
  };

CREATE TABLE test.foo (
    i int PRIMARY KEY,
    vector vector<float, 3>
);

i,vector
1,"[8, 2.3, 58]"
2,"[1.2, 3.4, 5.6]"
5,"[23, 18, 3.9]"

select vector from test.foo order by vector ann of [3.4, 7.8, 9.1] limit 1;

@absurdfarce absurdfarce linked an issue Jun 25, 2024 that may be closed by this pull request
| K_TIMEUUID
| K_DATE
| K_TIME
| K_VECTOR
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this a bit surprising. I haven't followed all the fuss around the introduction of this new type, but isn't it a complex type? As in: it can have parameters, e.g. vector<float, 3>. I don't think the current grammar would parse that.

Granted, type declarations are not very useful for DSBulk because it doesn't understand DDL statements. But they can still appear in SELECT or INSERT statements, notably when casting some value. Not sure if SELECT (vector<float, 3>) col1 FROM... makes sense, but someone might have a good usage for that.

Wouldn't it be better to make a little effort and properly create a rule for vectorType?

As for the problem of having the token VECTOR be considered an unreserved keyword: I see that in Cassandra's grammar it's listed under the basic_unreserved_keyword rule:

https://github.com/apache/cassandra/blob/487f59e66c2a5baaf06b7a4560898cd137802579/src/antlr/Parser.g#L2037

Maybe we can add it under basicUnreservedKeyword in DSBulk's grammar?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These aren't unreasonable points @adutra. I think it's probably worth a brief explanation as to how I got here but I strongly suspect it won't change the fundamental analysis very much.

In both the Astra and OSS Cassandra case "vector" is being implemented as a custom type (largely in order to avoid introducing a new protocol version). Think very much along the lines of what was done for duration in protocol v4. vector is slightly different because it's qualified by a subtype and a size which is something we haven't had to support for custom types to date.... but otherwise it's very similar to duration.

All of that said I don't disagree with your fundamental point; I think we can do a better job of matching up to what Cassandra does. There is an additional complication in that Astra only supports vectors of floats while OSS Cassandra supports vectors of arbitrary subtypes... but again I don't think that changes our process too much. The grammar used by dsbulk should aim to support the broadest possible case; if we wind up sending vectors of strings to Astra it'll fail for other reasons.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So after looking at this a bit I offer the following to reinforce @adutra 's point above:

cqlsh>  select (vector<float,1>)[0.123] from system.local;

 (org.apache.cassandra.cql3.CQL3Type$Raw$RawVector@2c7ea098)[0.123]
--------------------------------------------------------------------
                                                      b'=\xfb\xe7m'

(1 rows)

Pay no attention to the byte display problem there; that's a known issue with the version of cqlsh + Python driver I'm using.

Worth noting that CAST operations are not a concern here.

cqlsh>  select CAST (broadcast_address as vector<float,1>) from system.local;
SyntaxException: line 1:34 no viable alternative at input 'vector' (select CAST (broadcast_address as [vector]...)

Rationale is that cast only works with native types... and vector is very explicitly not a native type.

@absurdfarce
Copy link
Collaborator Author

Thanks for the review @adutra; your points are well-taken.

I think I have a good working impl incorporating your suggestions... please take another look when you have a sec!

@absurdfarce absurdfarce requested a review from adutra July 3, 2024 22:25
Copy link
Contributor

@adutra adutra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough explanations @absurdfarce. And yes, the grammar is looking much nicer 👍


@ParameterizedTest
@MethodSource
void should_detect_unsupported_vector_selector(String query, boolean expected) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: do we really need a parameterized test here? Are you anticipating more test cases?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bah, good catch. I think this is a holdover from when I was originally planning on some other tests which would involve various permutations of quoted values. Obviously that doesn't apply at all here... I'll change it.

@absurdfarce absurdfarce requested a review from adutra July 8, 2024 16:40
import org.junit.jupiter.params.provider.Arguments;
import org.junit.jupiter.params.provider.MethodSource;

import javax.management.Query;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mistakenly added during a refactoring; removing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parsing trouble when a column is called "vector"

3 participants