Skip to content

Conversation

@schneems
Copy link
Collaborator

@schneems schneems commented Nov 4, 2021

Perf difference

Before: 0.230749 0.005489 0.236238 ( 0.237043)
After: 0.197075 0.005009 0.202084 ( 0.202950)

Profile code

To generate profiler output, run:

$ DEBUG_PERF=1 bundle exec rspec spec/integration/dead_end_spec.rb

See the readme for more details. You can do that against the commit before this one, and this one to see the difference.

Before sha: 948ee5c
After sha: b9ff7bd

How I found the issue

Using qcachegrind on Mac, generating an output with RubyProf::CallTreePrinter I saw:

That a lot of time was being spent in CodeFrontier#<<, specifically in the sort! function and CodeBlock<=>.

The fix

Because we control insertion into the array, we know that it is always sorted. We can leverage this info to place new elements in at the right location instead of placing them and then re-sorting the whole array just to place one element.

I experimented with iterating from front to back, and in reverse. I found that in my test case reverse took 7,373 comparisons while forwards took 3,130. Both represent large savings.

After changing this logic:

You can see sort! no longer shows up. Instead, we're seeing reject! as a hotspot (though only taking up 8.4% time instead of 54% time seen in sort).

Strangely such a large bump in results only yielded ~1.15x faster overall performance change. It's still worth it, but not in line with what I expected from the tools.

## Perf difference

Before:  0.230749   0.005489   0.236238 (  0.237043)
After:  0.197075   0.005009   0.202084 (  0.202950)

## Profile code

To generate profiler output, run:

```
$ DEBUG_PERF=1 bundle exec rspec spec/integration/dead_end_spec.rb
```

See the readme for more details. You can do that against the commit before this one, and this one to see the difference.


## How I found the issue


Using `qcachegrind` on Mac, generating an output with `RubyProf::CallTreePrinter` I saw:

![](https://www.dropbox.com/s/xian4mbsgvi8xr7/Screen%20Shot%202021-11-03%20at%203.00.05%20PM.png?raw=1)

That a lot of time was being spent in `CodeFrontier#<<`, specifically in the `sort!` function and `CodeBlock<=>`.

## The fix

Because we control insertion into the array, we know that it is always sorted. We can leverage this info to place new elements in at the right location instead of placing them and then re-sorting the whole array just to place one element.

I experimented with iterating from front to back, and in reverse. I found that in my test case reverse took 7,373 comparisons while forwards took 3,130. Both represent large savings.

After changing this logic:

![](https://www.dropbox.com/s/oba4mmrmlndvkac/Screen%20Shot%202021-11-03%20at%202.58.40%20PM.png?raw=1)

You can see `sort!` no longer shows up. Instead, we're seeing `reject!` as a hotspot (though only taking up 8.4% time instead of 54% time seen in sort).

Strangely such a large bump in results only yielded ~1.15x faster overall performance change. It's still worth it, but not in line with what I expected from the tools.
@schneems schneems merged commit 050a6be into main Nov 4, 2021
@schneems schneems deleted the schneems/insertion-sort branch November 4, 2021 02:17
schneems added a commit to puma/puma that referenced this pull request Aug 31, 2025
… of `4e60f73` which is puma-7.0.0.pre1 (also tagged at https://github.com/puma/puma/releases/tag/v7.0.0.pre1).

This is intended to help debugging #3715 the idea is that we could modify puma to output profiling data to understand where time is being spent. I'm unsure if this is the correct approach. But it shows AN idea of how to profile what's going on inside of puma (instead of just guessing). Long term it would be cool to make this flexible and able to be toggleable similar to https://github.com/ruby/syntax_suggest/blob/8c13ff82656514f9481304da9b9d0bbd6923cceb/spec/spec_helper.rb#L72, but for now let's hardcode everything.

Ruby-prof is a massive shotgun for performance work, i.e. don't ever use it in production. It WILL slow your program to a crawl. But the data can be very valuable if you skim performance PRs on syntax_suggest you'll see I used it to great effect, for example ruby/syntax_suggest#104. (though normally I just look at the HTML output and scan for outliers). It's much harder here because it will capture not just what's happening in our codebase, but anything in rack and the app which SHOULD take up the bulk of time (in an ideal world). 

You can use it RIGHT NOW:

```
$ PUMA_RUBY_PROF_DIR=tmp bundle exec ruby -w 1 -Ilib bin/puma -w 1 test/rackup/hello.ru
```

When the server exits it will write profiling data to `tmp/<timestamp>` along with a symlink `tmp/last` to the most recent profile run. Then inside there will be N directories for N workers each with separate profile data.

Another thing to be mindful about is that profiling is "on" even when the server is doing no work. That means, any time spent before you're getting a request and any time spent after requests are done, but you've not killed the server yet, still count against you. That might skew our results i.e. if we spend 5 s before sending data, then send data for 5s and it takes us 5s to kill the server ... it might look like it spent FOREVER waiting to accept a new connection, but that is really just the server idling. I'm not sure how to account for that skew, just be mindful it's there. Suggestions welcome.

It also wouldn't catch things like "one worker is being greedy with accept calls and starving others" thankfully, the perf bug was reproducible with threads=1 workers=1 so we don't have to care about weird distributed emergent behavior.

Also, no need to ONLY use ruby-prof, you can use any profiler you want. If you feel like another one might be good (I found a lot of un-needed allocations with memory profiler in Rails) or there's vernier and some other sampling profilers. i.e. don't assume this is THE profiler, it's just one I've had success with in the past.

I'm thinking that the next step would be to validate this output is useful in isolation and/or modify the TechEmpower benchmark to:

- Use ruby-prof-d puma code
- Make the output of profile folder show up on the local machine (the tests run in a docker container, need to set mount a local directory and set PUMA_RUBY_PROF_DIR to the correct location).

Then once that's set:

- Profile, try to find where things are slow, make changes
- Re run TechEmpower with the attempted fix (and without ruby-prof) and see if it improves the results of Puma 7.0.0.pre1 versus Puma 6.6.1

This is overly verbose because: I'm pretty busy with my family this weekend, and I'm flying to Amsterdam on tuesday (monday is a holiday). I'm hoping people can take this suggestion and run with it if they have time. I'll try to poke at the problem on the plane but ... limited wifi etc. Don't assume I'll figure this out, many eyes make short work of bugs. The more hacking at this the faster it will go.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants