Skip to content

Conversation

@kevinludwig
Copy link

Tokenizer is using the dissect library to find match strings. disect uses a binary search to minimize the number of comparisons it has to make. That only works if the input sequence is sorted, (or if the predicate function can determine if the match is greater than, or less than, the current match. I'm sure there are a variety of cases which break because of this, but the one I found was a not short quoted string. If you put console.log inside the code, you'll see that the algorithm around disect is cutting the match in half, and continues to shorten the string to look for a match (ultimately determining incorrectly that there is no match).

I changed the code to do a for loop from the end of the string back to the start looking for the longest string with a matching rule. This works and all existing tests continue to work.

@Floby
Copy link
Owner

Floby commented Nov 26, 2014

Hello,
Just to tell you that i've seen this and there's also #9 looking interesting.
I should be reviewing these two pretty soon.

@farskipper
Copy link

Hey, I also had problems due to bisection. It seems like for some tokenizing problems bisection is definitely the right call, but not for all of them. For that reason I made tokenizer2 which doesn't use bisection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants