Improve the connector hashing and extend condesc_t #1528

ampli · 2024-05-17T14:06:02Z

This patch enumerates the connector descriptors and uses this number in the connector hash functions.
(I also plan to use it as an array index in a future parse speedup.)

It also extends the information obtained through Connector::desc without increasing its 32-bit size by adding a pointer to condesc_t pointing to extra data. This will allow us to add more data per connector type, e.g., costs for multi-connectors (#1351, #1352) and a cost table per connection length (I think mentioned in #632, and a placeholder added in 2c1d979). The idea is that the data needed during parsing for easy_match is still directly available, while obtaining other data before and after the parsing may need an extra redirection.

The hashing functions have also improved, and they should now keep the collisions low, hopefully without pathological cases (the current hash doesn't differentiate between connectors with more than 4 UP and 4 LC characters, and increasing connector_hash_t to size_t turned out to be too costly).
I handled two types of collisions:

Different connector lists that map to the same 32-bit hash value. Enabling the existing per-connector feedback hash (with different multiplications constants) reduced them by 2+ orders of magnitude.
Extra collisions due to different hash values that hashed to the same slot. The added Fibonacci multiplication significantly reduced them.

I've also included a few minor efficiency tweaks. I tested the speed on the corpus batches and the big-dict (from #1479), and it seems there is a small speedup (but this patch is not about speedup).

The second invocation of connector_list_equal(slot->clist, c) is not needed, because at this point we know that its result was "true" 2 lines before. So replace it by "true". This is only a very slight speedup, if any, because the compiler (in -O3) knows it is a "pure" function and thus returns its previous result.

Due to changes in include files I got a compiler warning that it shadows the global verbosity variable.

Keep condesc_t at 32 bytes after adding con_num, which will be used in connector hash and might later be used for radix sort.

Knuth's Method / Fibonacci Hashing

linas

the circular pointers seem unusual. Are you sure you meant that? Merging anyway.

linas · 2024-05-20T22:28:59Z

link-grammar/connectors.c

 		if (NULL == condesc) continue;
-		calculate_connector_info(condesc);
+		calculate_connector_info(&dict->contable.hdesc[n]);
 		sdesc[i++] = dict->contable.hdesc[n].desc;


this seems to be same as condesc

condesc is dict->contable.hdesc[n].desc, not dict->contable.hdesc[n].
To be clearer, these lines can be changed to (not tested):

hdesc_t *hdesc = &dict->contable.hdesc[n]; if (NULL == hdesc->desc) continue; calculate_connector_info(hdesc); sdesc[i++] = hdesc->desc;

linas · 2024-05-20T22:31:54Z

link-grammar/connectors.h

+{
+	lc_enc_t lc_letters;
+	lc_enc_t lc_mask;
+	hdesc_t *more;       /* More information, for keeping small struct size. */


So hdesc points at condesc, and condesc points at hdesc. Interesting ... the old code did not have this loop.

hdesc are the slots of the connector hash table. Each one that is used, has a desc pointer to a condesc_t element. Up to here it is a standard implementation of an open hash table.
The more elements of condesc_t need to point somewhere, and it was convenient to use the hdesc slots, which are already allocated (it requires less code). However, this is memory inefficient (and indirectly CPU inefficiently due to cache trashing) because most of the hdesc elements are empty (NULL desc).

The solution is to use a memory pool for the more elements. I will eventually send a cleanup PR for that (and the sort_condesc_by_uc_constring() code).

Oh! Ah! OK. I think I understand. This wasn't obvious from casual perusal, so I thought I'd point it out.

ampli added 11 commits May 15, 2024 17:10

connectors.h: Move up the definition of UNLIMITED_LEN

1d2a857

connectors.h: Move up the definition of hdesc

77b1a5c

link-generator.c: use_prompt(): Rename parameter to verbosity_setting

d421b76

Due to changes in include files I got a compiler warning that it shadows the global verbosity variable.

condesc_t: Enumerate and move some fields to hdesc_t

2bebbdb

Keep condesc_t at 32 bytes after adding con_num, which will be used in connector hash and might later be used for radix sort.

connector_hash(): Use the connector descriptor enumeration

944a3f0

connector_list_hash(): Better hash

8840a37

eliminate_duplicate_disjuncts(): Add collision debug stats

8f35c47

eliminate_duplicate_disjuncts(), tracon-set.c: Use Fibonacci Hashing

d325e3e

Knuth's Method / Fibonacci Hashing

eliminate_duplicate_disjuncts: unsigned -> connector_hash_t

1b5d1b5

eliminate_duplicate_disjuncts: Pre-calculate hash constants

244e053

ampli changed the title ~~Improve the connector hashing and add extend condesc_t~~ Improve the connector hashing and extend condesc_t May 17, 2024

linas reviewed May 20, 2024

View reviewed changes

linas merged commit 52bf569 into opencog:master May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve the connector hashing and extend condesc_t #1528

Improve the connector hashing and extend condesc_t #1528

Uh oh!

ampli commented May 17, 2024

Uh oh!

linas left a comment

Uh oh!

linas May 20, 2024

Uh oh!

ampli May 21, 2024

Uh oh!

linas May 20, 2024

Uh oh!

ampli May 21, 2024 •

edited

Loading

Uh oh!

linas May 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Improve the connector hashing and extend condesc_t #1528

Improve the connector hashing and extend condesc_t #1528

Uh oh!

Conversation

ampli commented May 17, 2024

Uh oh!

linas left a comment

Choose a reason for hiding this comment

Uh oh!

linas May 20, 2024

Choose a reason for hiding this comment

Uh oh!

ampli May 21, 2024

Choose a reason for hiding this comment

Uh oh!

linas May 20, 2024

Choose a reason for hiding this comment

Uh oh!

ampli May 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

linas May 21, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ampli May 21, 2024 •

edited

Loading