-
Notifications
You must be signed in to change notification settings - Fork 118
Improve the connector hashing and extend condesc_t #1528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The second invocation of connector_list_equal(slot->clist, c) is not needed, because at this point we know that its result was "true" 2 lines before. So replace it by "true". This is only a very slight speedup, if any, because the compiler (in -O3) knows it is a "pure" function and thus returns its previous result.
Due to changes in include files I got a compiler warning that it shadows the global verbosity variable.
Keep condesc_t at 32 bytes after adding con_num, which will be used in connector hash and might later be used for radix sort.
Knuth's Method / Fibonacci Hashing
linas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the circular pointers seem unusual. Are you sure you meant that? Merging anyway.
| if (NULL == condesc) continue; | ||
| calculate_connector_info(condesc); | ||
| calculate_connector_info(&dict->contable.hdesc[n]); | ||
| sdesc[i++] = dict->contable.hdesc[n].desc; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems to be same as condesc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
condesc is dict->contable.hdesc[n].desc, not dict->contable.hdesc[n].
To be clearer, these lines can be changed to (not tested):
hdesc_t *hdesc = &dict->contable.hdesc[n];
if (NULL == hdesc->desc) continue;
calculate_connector_info(hdesc);
sdesc[i++] = hdesc->desc;
| { | ||
| lc_enc_t lc_letters; | ||
| lc_enc_t lc_mask; | ||
| hdesc_t *more; /* More information, for keeping small struct size. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So hdesc points at condesc, and condesc points at hdesc. Interesting ... the old code did not have this loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hdesc are the slots of the connector hash table. Each one that is used, has a desc pointer to a condesc_t element. Up to here it is a standard implementation of an open hash table.
The more elements of condesc_t need to point somewhere, and it was convenient to use the hdesc slots, which are already allocated (it requires less code). However, this is memory inefficient (and indirectly CPU inefficiently due to cache trashing) because most of the hdesc elements are empty (NULL desc).
The solution is to use a memory pool for the more elements. I will eventually send a cleanup PR for that (and the sort_condesc_by_uc_constring() code).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! Ah! OK. I think I understand. This wasn't obvious from casual perusal, so I thought I'd point it out.
This patch enumerates the connector descriptors and uses this number in the connector hash functions.
(I also plan to use it as an array index in a future parse speedup.)
It also extends the information obtained through Connector::desc without increasing its 32-bit size by adding a pointer to
condesc_tpointing to extra data. This will allow us to add more data per connector type, e.g., costs for multi-connectors (#1351, #1352) and a cost table per connection length (I think mentioned in #632, and a placeholder added in 2c1d979). The idea is that the data needed during parsing for easy_match is still directly available, while obtaining other data before and after the parsing may need an extra redirection.The hashing functions have also improved, and they should now keep the collisions low, hopefully without pathological cases (the current hash doesn't differentiate between connectors with more than 4 UP and 4 LC characters, and increasing
connector_hash_ttosize_tturned out to be too costly).I handled two types of collisions:
I've also included a few minor efficiency tweaks. I tested the speed on the corpus batches and the
big-dict(from #1479), and it seems there is a small speedup (but this patch is not about speedup).