At first glance, CosineSimilarity seems like it would be an innocent ExternalFunction that takes in two strings and computes the cosine similarity between their bag of words.
But this is not what it is doing.
It expects each input string to be of the form: "<int>:<float> <int>:<float> <int>:<float> ..." (no
literal angle brackets, exactly one space) where the int is some id for the word and float is some score (tf-idf maybe?).
Then it will build vectors with those values at the specified indexes and take the cosine similarity of that.
No documentation or anything describing the un-intuitive behavior.
We need to rename this class and get an actual cosine similarity up.
We also need to do a full audit of all the textsim classes to make sure this ridiculousness doesn't happen again.
At first glance, CosineSimilarity seems like it would be an innocent ExternalFunction that takes in two strings and computes the cosine similarity between their bag of words.
But this is not what it is doing.
It expects each input string to be of the form: "<int>:<float> <int>:<float> <int>:<float> ..." (no
literal angle brackets, exactly one space) where the int is some id for the word and float is some score (tf-idf maybe?).
Then it will build vectors with those values at the specified indexes and take the cosine similarity of that.
No documentation or anything describing the un-intuitive behavior.
We need to rename this class and get an actual cosine similarity up.
We also need to do a full audit of all the textsim classes to make sure this ridiculousness doesn't happen again.