The string_pool is maintained as a sorted list of offsets, which when create_shared_string is called are searched using slice::binary_search_by
If a match is not found, then the string is stored using create_byte_string and the index is inserted into string_pool using Vec::insert
So the happy path is an O(LogN) binary search
The unhappy path is an O(N) vector insertion
Every happy path requires a preceding unhappy path, so if you have N unique strings and M duplicates, then the cost of all calls to create_shared_string becomes O(N^2) + O(MLogN) or simply O(N^2)
I have a stress test that is writing a flatbuffer with ~500,000 unique strings, 80% of our serialization time is in create_shared_string
I can reduce this to ~30% of our serialization time (total serialization time goes from to 20.4s to 4.1s) by
- changing the type of
string_pool to HashMap<String, WIPOffset<&'fbb str>>, and
- changing the implementation of
create_shared_string to
#[inline]
pub fn create_shared_string<'a: 'b, 'b>(&'a mut self, s: &'b str) -> WIPOffset<&'fbb str> {
self.assert_not_nested(
"create_shared_string can not be called when a table or vector is under construction",
);
match self.strings_pool.get(s) {
Some(address) => address,
None => {
let address = WIPOffset::new(self.create_byte_string(s.as_bytes()).value());
self.strings_pool.insert(s.to_owned(), address);
address
}
}
}
The
string_poolis maintained as a sorted list of offsets, which whencreate_shared_stringis called are searched usingslice::binary_search_byIf a match is not found, then the string is stored using
create_byte_stringand the index is inserted intostring_poolusingVec::insertSo the happy path is an O(LogN) binary search
The unhappy path is an O(N) vector insertion
Every happy path requires a preceding unhappy path, so if you have N unique strings and M duplicates, then the cost of all calls to
create_shared_stringbecomes O(N^2) + O(MLogN) or simply O(N^2)I have a stress test that is writing a flatbuffer with ~500,000 unique strings, 80% of our serialization time is in
create_shared_stringI can reduce this to ~30% of our serialization time (total serialization time goes from to 20.4s to 4.1s) by
string_pooltoHashMap<String, WIPOffset<&'fbb str>>, andcreate_shared_stringto