Skip to content
This repository was archived by the owner on Oct 30, 2021. It is now read-only.

Ensure Context is noncopyable#93

Closed
springmeyer wants to merge 7 commits intomasterfrom
move-only-context
Closed

Ensure Context is noncopyable#93
springmeyer wants to merge 7 commits intomasterfrom
move-only-context

Conversation

@springmeyer
Copy link
Contributor

@springmeyer springmeyer commented Apr 1, 2017

This fixes an odd case (still need to research more into the why here) where Context objects were still being copied rather than moved (even though they had a move constructor enabled).

By moving these large objects we should avoid many allocations and help speed up performance.

On OS X this leads to a significant speedup. On Linux the speedup is less but still noticable.

OSX shows:

Master

# coalesceMulti
ok 117 coalesceMulti @ 7.56ms
# coalesceMulti proximity
ok 118 coalesceMulti + proximity @ 7.44ms

This branch

ok 117 coalesceMulti @ 6.62ms
# coalesceMulti proximity
ok 118 coalesceMulti + proximity @ 6.46ms

@springmeyer
Copy link
Contributor Author

springmeyer commented Apr 5, 2017

Latest results with std::partial_sort on OS X:

TAP version 13
# coalesceSingle
ok 1 coalesceSingle @ 0.42ms
# coalesceSingle proximity
ok 2 coalesceSingle + proximity @ 0.52ms
# coalesceMulti
ok 3 coalesceMulti @ 4.84ms
# coalesceMulti proximity
ok 4 coalesceMulti + proximity @ 4.78ms

@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) to 78.903% when pulling 83b555d on move-only-context into 93652b2 on master.

@springmeyer
Copy link
Contributor Author

springmeyer commented Apr 5, 2017

Because the partial sort speeds things up so much I took a look at profiling results with 45c91ed and after partial sort (83b555d) on OS X.

In both cases, for one thread sampled, 22% of the thread is busy in coalesceMulti (This 22% makes sense since the test is run serially). But the speedup with partial sort is obvious because it moves the sort from being where the largest time taken to only 4th most expensive.

Before:

  • 7.2% is taken in std::sort of ContextSortByRelev
  • 5.3% is taken in deallocation of memory (free)
  • 3.8% is taken in allocation of memory (malloc/new)
  • 2.2% is taken in the std::sort in __getmatching
  • Everything else is less than 1%

After:

  • 5.9% is taken in deallocation of memory (free)
  • 5.1% is taken in allocation of memory (malloc/new)
  • 3.8% is taken in the std::sort in __getmatching
  • 1.3% is taken in std::partial_sort of ContextSortByRelev
  • Everything else is less than 1%

@apendleton
Copy link
Contributor

Sidenote as far as next steps here: __getmatching is actually overloaded and has two implementations, one for rocksdb and one for the in-memory cache. We only actually use the former in production, and only the latter calls std::sort (the rocksdb version sorts lazily). We may want to adjust these benchmarks accordingly as I wonder if the performance characteristics will be different... I expect that the rocksdb implementation may well be slower because it doesn't hold everything in ram, though, so it might make benchmarks a little noisier.

@springmeyer
Copy link
Contributor Author

springmeyer commented Apr 5, 2017

Thanks @apendleton - per chat I've love help on adjusting the benchmark to test the right thing. Feel free to commit to this branch or another with any fixes there.

The other next steps here I see are:

  • Figure out if the partial_sort can be done on a fixed, predictable N. Currently we depend on filtering to a max of 40 after sorting and so the partial_sort needs to run on more than 40 values. This is because we further filter after sorting here and here. If we could dedupe features before sorting that would allow the sort to be much faster.
  • Test in carmen and see if the perf gains above also translate to production before spending too much more time here.

@springmeyer springmeyer changed the title Ensure Context is noncopyable Ensure Context is noncopyable + partial_sort Apr 5, 2017
@springmeyer
Copy link
Contributor Author

Noting that while OS X results are impressive, results on linux/travis are not. So I need to see if travis is lying (very possible) or perhaps libstdc++ (defalt standard library on linux) is not as optimized for partial_sort as libc++ (OSX):

master:

# coalesceSingle
ok 115 coalesceSingle @ 0.48ms
# coalesceSingle proximity
ok 116 coalesceSingle + proximity @ 0.68ms
# coalesceMulti
ok 117 coalesceMulti @ 7.56ms
# coalesceMulti proximity
ok 118 coalesceMulti + proximity @ 7.44ms

45c91ed (this branch, before partial_sort)

# coalesceSingle
ok 115 coalesceSingle @ 0.5ms
# coalesceSingle proximity
ok 116 coalesceSingle + proximity @ 0.68ms
# coalesceMulti
ok 117 coalesceMulti @ 7.22ms
# coalesceMulti proximity
ok 118 coalesceMulti + proximity @ 7.52ms

83b555d (this branch, after partial_sort)

# coalesceSingle
ok 115 coalesceSingle @ 0.82ms
# coalesceSingle proximity
ok 116 coalesceSingle + proximity @ 0.84ms
# coalesceMulti
ok 117 coalesceMulti @ 5.68ms
# coalesceMulti proximity
ok 118 coalesceMulti + proximity @ 7.44ms

@mourner
Copy link
Member

mourner commented Apr 6, 2017

If we could dedupe features before sorting that would allow the sort to be much faster.

Yes! Also note that for deduping, the existing code in coalesceFinalize uses std::map. Since we're deduping by a uint id, using std::unordered_set should be much faster since insertion/lookup is constant time on average.

@springmeyer
Copy link
Contributor Author

Update on my thinking on this PR:

  • The non-copyable fixes are 👍. While they may not show huge perf gains in local testing they should reduce allocations and therefore should help most under high load
  • The partial_sort approach feels very promising and getting close but not quite right. The next actions at Ensure Context is noncopyable #93 (comment) still stand. Ideally the partial sort could be done on a fixed N. So, we should - as @mourner - mentions revisit how deduping works.

I'm going to pause here since revisit how deduping works is above my ability to focus on the code right now. I would be happy to spirit guide someone who wants to pick this up.

@springmeyer springmeyer changed the title Ensure Context is noncopyable + partial_sort Ensure Context is noncopyable May 23, 2017
@springmeyer
Copy link
Contributor Author

I've removed the experimental/not-quite-yet-ideal partial_sort code from this PR in 57ec8fc.

This means that I think this is ready to be tested in production since the remaining changes are clear improvements.

  • Avoiding double move of std::unique_ptr<rocksdb::DB> (and a compiler warning on pessimizing move)
  • Deleting the automatic-compiler-generated constructor, copy constructor, and assignment constructor of the Context class. This ensures that the Context is truly "noncopyable" and therefore only movable. This will save on memory allocations on linux where otherwise the object was being copied rather than moved.
  • Avoiding reserving too much memory when calling context.reserve and there were less than 40 contexts to pick from.

/cc @KaiBot3000 @aarthykc - would you be interested in running this out? My hope is that this will help reduce memory usage in production and therefore might also help performance. The steps would be:

  • Tweak the version in package.json to something unique for testing like 0.18.0-moveonly
  • Publish C++ binaries: `git commit -am "[publish binary]"
  • Test in carmen and if things look good
  • Stage, and if things look good
  • Run out an official release of carmen-cache by incrementing the version to 0.19.0 and then:
    • Publishing binaries again
    • Then running npm publish
    • Updating carmen to 0.19.0 of carmen cache

If #96 is looking good, this could also be merged into master and tested alongside that as it is deployed.

@springmeyer
Copy link
Contributor Author

Note, after #116 lands I think it would be a good time to revisit this and get it landed. Basically making sure that the Context is noncopyable to ensure it is moved by the compiler in all cases:

+    Context() = delete;
+    Context(Context const& c) = delete;
+    Context& operator=(Context const& c) = delete;

Also as a future idea/ticket: would be great to apply learnings on sorting optimizations from vtquery. Whoever wants to pick this up reach out to @mapsam for tips on sorting optimizations in C++.

@apendleton
Copy link
Contributor

We made Context noncopyable in #116 . @springmeyer closing as I think ultimately that's all we wanted to carry over from here, but if there was other stuff from the earlier iterations of this PR that you wanted us to do as well, feel free to reopen.

@springmeyer
Copy link
Contributor Author

Sounds good @apendleton - that should help. I've ticketed #120 to log the idea of using partial_sort in the future.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants