~1.15x faster perf w/ CodeFrontier insertion sort #104

schneems · 2021-11-04T02:14:29Z

Perf difference

Before: 0.230749 0.005489 0.236238 ( 0.237043)
After: 0.197075 0.005009 0.202084 ( 0.202950)

Profile code

To generate profiler output, run:

$ DEBUG_PERF=1 bundle exec rspec spec/integration/dead_end_spec.rb

See the readme for more details. You can do that against the commit before this one, and this one to see the difference.

Before sha: 948ee5c
After sha: b9ff7bd

How I found the issue

Using qcachegrind on Mac, generating an output with RubyProf::CallTreePrinter I saw:

That a lot of time was being spent in CodeFrontier#<<, specifically in the sort! function and CodeBlock<=>.

The fix

Because we control insertion into the array, we know that it is always sorted. We can leverage this info to place new elements in at the right location instead of placing them and then re-sorting the whole array just to place one element.

I experimented with iterating from front to back, and in reverse. I found that in my test case reverse took 7,373 comparisons while forwards took 3,130. Both represent large savings.

After changing this logic:

You can see sort! no longer shows up. Instead, we're seeing reject! as a hotspot (though only taking up 8.4% time instead of 54% time seen in sort).

Strangely such a large bump in results only yielded ~1.15x faster overall performance change. It's still worth it, but not in line with what I expected from the tools.

## Perf difference Before: 0.230749 0.005489 0.236238 ( 0.237043) After: 0.197075 0.005009 0.202084 ( 0.202950) ## Profile code To generate profiler output, run: ``` $ DEBUG_PERF=1 bundle exec rspec spec/integration/dead_end_spec.rb ``` See the readme for more details. You can do that against the commit before this one, and this one to see the difference. ## How I found the issue Using `qcachegrind` on Mac, generating an output with `RubyProf::CallTreePrinter` I saw: ![](https://www.dropbox.com/s/xian4mbsgvi8xr7/Screen%20Shot%202021-11-03%20at%203.00.05%20PM.png?raw=1) That a lot of time was being spent in `CodeFrontier#<<`, specifically in the `sort!` function and `CodeBlock<=>`. ## The fix Because we control insertion into the array, we know that it is always sorted. We can leverage this info to place new elements in at the right location instead of placing them and then re-sorting the whole array just to place one element. I experimented with iterating from front to back, and in reverse. I found that in my test case reverse took 7,373 comparisons while forwards took 3,130. Both represent large savings. After changing this logic: ![](https://www.dropbox.com/s/oba4mmrmlndvkac/Screen%20Shot%202021-11-03%20at%202.58.40%20PM.png?raw=1) You can see `sort!` no longer shows up. Instead, we're seeing `reject!` as a hotspot (though only taking up 8.4% time instead of 54% time seen in sort). Strangely such a large bump in results only yielded ~1.15x faster overall performance change. It's still worth it, but not in line with what I expected from the tools.

… of `4e60f73` which is puma-7.0.0.pre1 (also tagged at https://github.com/puma/puma/releases/tag/v7.0.0.pre1). This is intended to help debugging #3715 the idea is that we could modify puma to output profiling data to understand where time is being spent. I'm unsure if this is the correct approach. But it shows AN idea of how to profile what's going on inside of puma (instead of just guessing). Long term it would be cool to make this flexible and able to be toggleable similar to https://github.com/ruby/syntax_suggest/blob/8c13ff82656514f9481304da9b9d0bbd6923cceb/spec/spec_helper.rb#L72, but for now let's hardcode everything. Ruby-prof is a massive shotgun for performance work, i.e. don't ever use it in production. It WILL slow your program to a crawl. But the data can be very valuable if you skim performance PRs on syntax_suggest you'll see I used it to great effect, for example ruby/syntax_suggest#104. (though normally I just look at the HTML output and scan for outliers). It's much harder here because it will capture not just what's happening in our codebase, but anything in rack and the app which SHOULD take up the bulk of time (in an ideal world). You can use it RIGHT NOW: ``` $ PUMA_RUBY_PROF_DIR=tmp bundle exec ruby -w 1 -Ilib bin/puma -w 1 test/rackup/hello.ru ``` When the server exits it will write profiling data to `tmp/<timestamp>` along with a symlink `tmp/last` to the most recent profile run. Then inside there will be N directories for N workers each with separate profile data. Another thing to be mindful about is that profiling is "on" even when the server is doing no work. That means, any time spent before you're getting a request and any time spent after requests are done, but you've not killed the server yet, still count against you. That might skew our results i.e. if we spend 5 s before sending data, then send data for 5s and it takes us 5s to kill the server ... it might look like it spent FOREVER waiting to accept a new connection, but that is really just the server idling. I'm not sure how to account for that skew, just be mindful it's there. Suggestions welcome. It also wouldn't catch things like "one worker is being greedy with accept calls and starving others" thankfully, the perf bug was reproducible with threads=1 workers=1 so we don't have to care about weird distributed emergent behavior. Also, no need to ONLY use ruby-prof, you can use any profiler you want. If you feel like another one might be good (I found a lot of un-needed allocations with memory profiler in Rails) or there's vernier and some other sampling profilers. i.e. don't assume this is THE profiler, it's just one I've had success with in the past. I'm thinking that the next step would be to validate this output is useful in isolation and/or modify the TechEmpower benchmark to: - Use ruby-prof-d puma code - Make the output of profile folder show up on the local machine (the tests run in a docker container, need to set mount a local directory and set PUMA_RUBY_PROF_DIR to the correct location). Then once that's set: - Profile, try to find where things are slow, make changes - Re run TechEmpower with the attempted fix (and without ruby-prof) and see if it improves the results of Puma 7.0.0.pre1 versus Puma 6.6.1 This is overly verbose because: I'm pretty busy with my family this weekend, and I'm flying to Amsterdam on tuesday (monday is a holiday). I'm hoping people can take this suggestion and run with it if they have time. I'll try to poke at the problem on the plane but ... limited wifi etc. Don't assume I'll figure this out, many eyes make short work of bugs. The more hacking at this the faster it will go.

schneems added the skip changelog label Nov 4, 2021

schneems merged commit 050a6be into main Nov 4, 2021

schneems deleted the schneems/insertion-sort branch November 4, 2021 02:17

schneems mentioned this pull request Nov 4, 2021

~1.32x Faster checking with CleanDocument regex #105

Merged

schneems mentioned this pull request Aug 31, 2025

POC: Profile each worker with ruby-prof puma/puma#3717

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

~1.15x faster perf w/ CodeFrontier insertion sort #104

~1.15x faster perf w/ CodeFrontier insertion sort #104

Uh oh!

schneems commented Nov 4, 2021 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

~1.15x faster perf w/ CodeFrontier insertion sort #104

~1.15x faster perf w/ CodeFrontier insertion sort #104

Uh oh!

Conversation

schneems commented Nov 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Perf difference

Profile code

How I found the issue

The fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

schneems commented Nov 4, 2021 •

edited

Loading