Switch to multimap based nfd_map due to compile time issues#5799
Switch to multimap based nfd_map due to compile time issues#5799ggerganov merged 3 commits intoggml-org:masterfrom
Conversation
|
Compile times would likely be reduced by making unicode.h a proper .cpp file instead of a header full of static tables and functions - especially if you include the time to compile the tests. The rust implementation that apage43 linked uses minimally perfect hash tables for faster lookup (all keys are guaranteed to have a different hash so collisions are impossible) - if lookup is a bottleneck, maybe it would be worth trying to implement something similar. (Somebody should make a flamegraph first to confirm - maybe I will if I have time.) |
|
Turns out we were spending an inordinate amount of time creating |
ggerganov
left a comment
There was a problem hiding this comment.
It's still not back to original level, but much better:
# before
real 0m5.506s
user 0m5.352s
sys 0m0.101s
# after
real 0m6.396s
user 0m6.216s
sys 0m0.109s* switch to multimap based nfd_map due to compile time issues * simplify multimap keys * dont construct new locale every time
* switch to multimap based nfd_map due to compile time issues * simplify multimap keys * dont construct new locale every time
* switch to multimap based nfd_map due to compile time issues * simplify multimap keys * dont construct new locale every time
* switch to multimap based nfd_map due to compile time issues * simplify multimap keys * dont construct new locale every time
Fixes issues with #5740. Yields same tokenizer outcomes as before but brings compile time back to normal. Performance also seems to be unchanged. Though in testing I did notice that #5740 itself does appear to reduce perfomance substantially, so there is hopefully a better long-term solution here.