feat: nixos-bench reports ghc stats by brprice · Pull Request #890 · hackworthltd/primer

brprice · 2023-02-27T15:16:43Z

We explicitly whitelist a small selection of the reported statistics, to make the results easier to read.

brprice · 2023-02-27T15:21:04Z

GHC spits out quite a lot of stats, from which I have selected a few entries which seemed most useful. I'd appreciate feedback on whether this selection is optimal.

The raw data from GHC looks like (click to unfold)

'/nix/store/nrihhkxm7r6gclc9bzs69nzs5s4kynpn-primer-service-exe-primer-service-0.7.2.0/bin/primer-service' 'serve' 'vUNKNOWN' '--port' '8081' +RTS '-N' '-t/tmp/ghc-stats' '--machine-readable' 
 [("bytes allocated", "407616")
 ,("num_GCs", "4")
 ,("average_bytes_used", "107088")
 ,("max_bytes_used", "115392")
 ,("num_byte_usage_samples", "4")
 ,("peak_megabytes_allocated", "6")
 ,("init_cpu_seconds", "0.005759")
 ,("init_wall_seconds", "0.053852")
 ,("mut_cpu_seconds", "0.048533")
 ,("mut_wall_seconds", "20.303372")
 ,("GC_cpu_seconds", "0.224454")
 ,("GC_wall_seconds", "1.306344")
 ,("exit_cpu_seconds", "0.002124")
 ,("exit_wall_seconds", "0.005371")
 ,("total_cpu_seconds", "0.280879")
 ,("total_wall_seconds", "21.668947")
 ,("major_gcs", "4")
 ,("allocated_bytes", "407616")
 ,("max_live_bytes", "115392")
 ,("max_large_objects_bytes", "90184")
 ,("max_compact_bytes", "0")
 ,("max_slop_bytes", "28888")
 ,("max_mem_in_use_bytes", "6291456")
 ,("cumulative_live_bytes", "428352")
 ,("copied_bytes", "93440")
 ,("par_copied_bytes", "0")
 ,("cumulative_par_max_copied_bytes", "0")
 ,("cumulative_par_balanced_copied_bytes", "0")
 ,("fragmentation_bytes", "0")
 ,("alloc_rate", "8398741")
 ,("productivity_cpu_percent", "0.172823")
 ,("productivity_wall_percent", "0.936980")
 ,("bound_task_count", "1")
 ,("sparks_count", "0")
 ,("sparks_converted", "0")
 ,("sparks_overflowed", "0")
 ,("sparks_dud ", "0")
 ,("sparks_gcd", "0")
 ,("sparks_fizzled", "0")
 ,("work_balance", "0.000000")
 ,("n_capabilities", "1")
 ,("task_count", "5")
 ,("peak_worker_count", "4")
 ,("worker_count", "4")
 ,("gc_alloc_block_sync_spin", "0")
 ,("gc_alloc_block_sync_yield", "0")
 ,("gc_alloc_block_sync_spin", "0")
 ,("waitForGcThreads_spin", "0")
 ,("waitForGcThreads_yield", "0")
 ,("whitehole_gc_spin", "0")
 ,("whitehole_lockClosure_spin", "0")
 ,("whitehole_lockClosure_yield", "0")
 ,("whitehole_executeMessage_spin", "0")
 ,("whitehole_threadPaused_spin", "0")
 ,("any_work", "0")
 ,("scav_find_work", "0")
 ,("max_n_todo_overflow", "0")
 ,("gen_0_collections", "0")
 ,("gen_0_par_collections", "0")
 ,("gen_0_cpu_seconds", "0.000000")
 ,("gen_0_wall_seconds", "0.000000")
 ,("gen_0_max_pause_seconds", "0.000000")
 ,("gen_0_avg_pause_seconds", "0.000000")
 ,("gen_0_sync_spin", "0")
 ,("gen_0_sync_yield", "0")
 ,("gen_1_collections", "4")
 ,("gen_1_par_collections", "0")
 ,("gen_1_cpu_seconds", "0.224454")
 ,("gen_1_wall_seconds", "1.306344")
 ,("gen_1_max_pause_seconds", "1.300824")
 ,("gen_1_avg_pause_seconds", "0.326586")
 ,("gen_1_sync_spin", "0")
 ,("gen_1_sync_yield", "0")
 ]

dhess · 2023-02-27T16:06:53Z

Using https://hackage.haskell.org/package/base-4.17.0.0/docs/GHC-Stats.html as a reference, and assuming they mean what I think they mean, I suggest at least:

"num_gcs"
"max_mem_in_use_bytes"
"GC_cpu_seconds"

Open questions:

It appears there's no way to get the average bytes in use, is that right?
I don't think we particular care about the live bytes numbers, am I wrong? What we really care about is how much memory the process is taking, so that we can appropriately size our VM instances. Does GHC look at available memory and use that to decide how often to GC? (I seem to remember a discussion awhile ago about how we could use some flags to the RTS to say, "try to stay under this amount of allocated memory"; might be a good time to turn that on, at least for these tests.)
"gen_1_max_pause_seconds" and "gen_1_avg_pause_seconds", I'm guessing these refer to the nonmoving GC "post-mark pause phase." Is this the amount of time that program execution is paused during a GC?

dhess

Thanks!

dhess · 2023-02-27T21:55:30Z

      primer.fail("journalctl -b -u primer | grep -q ERROR")
+      primer.succeed("systemctl stop primer")
+      ghcstats = primer.succeed("tail -n+2 /tmp/ghc-stats")
+      for (k,v) in ast.literal_eval(ghcstats):


I wouldn't want to do this in production code, but it's a pretty nice hack for a benchmark test.

Yeah, if we want to get these stats in production code, we should use https://downloads.haskell.org/ghc/latest/docs/libraries/base-4.17.0.0/GHC-Stats.html to expose stats from a running service.

Actually, come to think of it, don't we get live stats from our dev deployment? How does that work?

Yes, we use a Prometheus scraper to monitor live instances.

Ah, that's interesting. I could instead make a request to /metrics and scrape that, rather than killing the server. I might look into that. Those metrics look like

Details

# HELP http_request_duration_seconds The HTTP request latencies in seconds. # TYPE http_request_duration_seconds histogram http_request_duration_seconds_bucket{handler="app",method="GET",status_code="200",le="0.005"} 1 http_request_duration_seconds_bucket{handler="app",method="GET",status_code="200",le="0.01"} 1 http_request_duration_seconds_bucket{handler="app",method="GET",status_code="200",le="0.025"} 1 http_request_duration_seconds_bucket{handler="app",method="GET",status_code="200",le="0.05"} 1 http_request_duration_seconds_bucket{handler="app",method="GET",status_code="200",le="0.1"} 1 http_request_duration_seconds_bucket{handler="app",method="GET",status_code="200",le="0.25"} 1 http_request_duration_seconds_bucket{handler="app",method="GET",status_code="200",le="0.5"} 1 http_request_duration_seconds_bucket{handler="app",method="GET",status_code="200",le="1.0"} 1 http_request_duration_seconds_bucket{handler="app",method="GET",status_code="200",le="2.5"} 1 http_request_duration_seconds_bucket{handler="app",method="GET",status_code="200",le="5.0"} 1 http_request_duration_seconds_bucket{handler="app",method="GET",status_code="200",le="10.0"} 1 http_request_duration_seconds_bucket{handler="app",method="GET",status_code="200",le="+Inf"} 1 http_request_duration_seconds_sum{handler="app",method="GET",status_code="200"} 5.9314e-5 http_request_duration_seconds_count{handler="app",method="GET",status_code="200"} 1 http_request_duration_seconds_bucket{handler="prometheus",method="GET",status_code="200",le="0.005"} 3 http_request_duration_seconds_bucket{handler="prometheus",method="GET",status_code="200",le="0.01"} 3 http_request_duration_seconds_bucket{handler="prometheus",method="GET",status_code="200",le="0.025"} 3 http_request_duration_seconds_bucket{handler="prometheus",method="GET",status_code="200",le="0.05"} 3 http_request_duration_seconds_bucket{handler="prometheus",method="GET",status_code="200",le="0.1"} 3 http_request_duration_seconds_bucket{handler="prometheus",method="GET",status_code="200",le="0.25"} 3 http_request_duration_seconds_bucket{handler="prometheus",method="GET",status_code="200",le="0.5"} 3 http_request_duration_seconds_bucket{handler="prometheus",method="GET",status_code="200",le="1.0"} 3 http_request_duration_seconds_bucket{handler="prometheus",method="GET",status_code="200",le="2.5"} 3 http_request_duration_seconds_bucket{handler="prometheus",method="GET",status_code="200",le="5.0"} 3 http_request_duration_seconds_bucket{handler="prometheus",method="GET",status_code="200",le="10.0"} 3 http_request_duration_seconds_bucket{handler="prometheus",method="GET",status_code="200",le="+Inf"} 3 http_request_duration_seconds_sum{handler="prometheus",method="GET",status_code="200"} 2.91299e-4 http_request_duration_seconds_count{handler="prometheus",method="GET",status_code="200"} 3 # HELP ghc_gcs_total Total number of GCs # TYPE ghc_gcs_total counter ghc_gcs_total 17 # HELP ghc_major_gcs_total Total number of major (oldest generation) GCs # TYPE ghc_major_gcs_total counter ghc_major_gcs_total 16 # HELP ghc_allocated_bytes_total Total bytes allocated # TYPE ghc_allocated_bytes_total counter ghc_allocated_bytes_total 9171656 # HELP ghc_max_live_bytes Maximum live data (including large objects + compact regions) # TYPE ghc_max_live_bytes gauge ghc_max_live_bytes 390168 # HELP ghc_max_large_objects_bytes Maximum live data in large objects # TYPE ghc_max_large_objects_bytes gauge ghc_max_large_objects_bytes 116808 # HELP ghc_max_compact_bytes Maximum live data in compact regions # TYPE ghc_max_compact_bytes gauge ghc_max_compact_bytes 0 # HELP ghc_max_slop_bytes Maximum slop # TYPE ghc_max_slop_bytes gauge ghc_max_slop_bytes 56528 # HELP ghc_max_mem_in_use_bytes Maximum memory in use by the RTS # TYPE ghc_max_mem_in_use_bytes gauge ghc_max_mem_in_use_bytes 23068672 # HELP ghc_cumulative_live_bytes_total Sum of live bytes across all major GCs. Divided by major_gcs gives the average live data over the lifetime of the program. # TYPE ghc_cumulative_live_bytes_total counter ghc_cumulative_live_bytes_total 5366288 # HELP ghc_copied_bytes_total Sum of copied_bytes across all GCs # TYPE ghc_copied_bytes_total counter ghc_copied_bytes_total 3967168 # HELP ghc_par_copied_bytes_total Sum of copied_bytes across all parallel GCs # TYPE ghc_par_copied_bytes_total counter ghc_par_copied_bytes_total 3967168 # HELP ghc_cumulative_par_max_copied_bytes_total Sum of par_max_copied_bytes across all parallel GCs # TYPE ghc_cumulative_par_max_copied_bytes_total counter ghc_cumulative_par_max_copied_bytes_total 1819704 # HELP ghc_mutator_cpu_seconds_total Total CPU time used by the mutator # TYPE ghc_mutator_cpu_seconds_total counter ghc_mutator_cpu_seconds_total 0.043725594 # HELP ghc_mutator_elapsed_seconds_total Total elapsed time used by the mutator # TYPE ghc_mutator_elapsed_seconds_total counter ghc_mutator_elapsed_seconds_total 238.947842380 # HELP ghc_gc_cpu_seconds_total Total CPU time used by the GC # TYPE ghc_gc_cpu_seconds_total counter ghc_gc_cpu_seconds_total 0.092954949 # HELP ghc_gc_elapsed_seconds_total Total elapsed time used by the GC # TYPE ghc_gc_elapsed_seconds_total counter ghc_gc_elapsed_seconds_total 0.056046031 # HELP ghc_cpu_seconds_total Total CPU time (at the previous GC) # TYPE ghc_cpu_seconds_total counter ghc_cpu_seconds_total 0.136680543 # HELP ghc_elapsed_seconds_total Total elapsed time (at the previous GC) # TYPE ghc_elapsed_seconds_total counter ghc_elapsed_seconds_total 239.003888411 # HELP ghc_gcdetails_gen The generation number of this GC # TYPE ghc_gcdetails_gen histogram ghc_gcdetails_gen 1 # HELP ghc_gcdetails_threads Number of threads used in this GC # TYPE ghc_gcdetails_threads gauge ghc_gcdetails_threads 4 # HELP ghc_gcdetails_allocated_bytes Number of bytes allocated since the previous GC # TYPE ghc_gcdetails_allocated_bytes gauge ghc_gcdetails_allocated_bytes 1712 # HELP ghc_gcdetails_live_bytes Total amount of live data in the heap (including large + compact data) # TYPE ghc_gcdetails_live_bytes gauge ghc_gcdetails_live_bytes 382216 # HELP ghc_gcdetails_large_objects_bytes Total amount of live data in large objects # TYPE ghc_gcdetails_large_objects_bytes gauge ghc_gcdetails_large_objects_bytes 111264 # HELP ghc_gcdetails_compact_bytes Total amount of live data in compact regions # TYPE ghc_gcdetails_compact_bytes gauge ghc_gcdetails_compact_bytes 0 # HELP ghc_gcdetails_slop_bytes Total amount of slop (wasted memory) # TYPE ghc_gcdetails_slop_bytes gauge ghc_gcdetails_slop_bytes 47864 # HELP ghc_gcdetails_mem_in_use_bytes Total amount of memory in use by the RTS # TYPE ghc_gcdetails_mem_in_use_bytes counter ghc_gcdetails_mem_in_use_bytes 23068672 # HELP ghc_gcdetails_copied_bytes Total amount of data copied during this GC # TYPE ghc_gcdetails_copied_bytes gauge ghc_gcdetails_copied_bytes 272112 # HELP ghc_gcdetails_par_max_copied_bytes In parallel GC, the max amount of data copied by any one thread # TYPE ghc_gcdetails_par_max_copied_bytes gauge ghc_gcdetails_par_max_copied_bytes 90112 # HELP ghc_gcdetails_sync_elapsed_seconds The time elapsed during synchronisation before GC # TYPE ghc_gcdetails_sync_elapsed_seconds gauge ghc_gcdetails_sync_elapsed_seconds 0.000127675 # HELP ghc_gcdetails_cpu_seconds The CPU time used during GC itself # TYPE ghc_gcdetails_cpu_seconds gauge ghc_gcdetails_cpu_seconds 0.006119019 # HELP ghc_gcdetails_elapsed_seconds The time elapsed during GC itself # TYPE ghc_gcdetails_elapsed_seconds gauge ghc_gcdetails_elapsed_seconds 0.002675058

I suppose it would also be a good test to ensure that the Prometheus endpoint is working, but
~~I'm satisfied with the current approach.~~

On second thought, it is probably just as easy to extract the Prometheus metrics and parse those, and this approach would work with the Docker image as well, if we want to switch to using that rather than a bespoke NixOS service, so sure, let's go for it.

I have done this change in 1b9ee56 (separate commit to aid reviewing), which I'll squash before merging

We explicitly whitelist a small selection of the reported statistics, to make the results easier to read.

brprice mentioned this pull request Feb 27, 2023

feat: benchmark network traffic and database size #837

Merged

brprice force-pushed the brprice/nixos-bench-ghc-stats branch from fc4c297 to 2eca1d8 Compare February 27, 2023 15:28

brprice requested a review from a team February 27, 2023 15:28

dhess approved these changes Feb 27, 2023

View reviewed changes

dhess approved these changes Mar 1, 2023

View reviewed changes

feat: nixos-bench reports ghc stats

01b0171

We explicitly whitelist a small selection of the reported statistics, to make the results easier to read.

brprice force-pushed the brprice/nixos-bench-ghc-stats branch from 1b9ee56 to 01b0171 Compare March 2, 2023 11:13

brprice enabled auto-merge March 2, 2023 11:13

brprice added this pull request to the merge queue Mar 2, 2023

Merged via the queue into main with commit a9fba0a Mar 2, 2023

brprice deleted the brprice/nixos-bench-ghc-stats branch March 2, 2023 11:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: nixos-bench reports ghc stats#890

feat: nixos-bench reports ghc stats#890
brprice merged 1 commit intomainfrom
brprice/nixos-bench-ghc-stats

brprice commented Feb 27, 2023

Uh oh!

brprice commented Feb 27, 2023

Uh oh!

dhess commented Feb 27, 2023 •

edited

Loading

Uh oh!

dhess left a comment

Uh oh!

dhess Feb 27, 2023

Uh oh!

brprice Feb 28, 2023

Uh oh!

dhess Feb 28, 2023

Uh oh!

brprice Feb 28, 2023

Uh oh!

dhess Feb 28, 2023 •

edited

Loading

Uh oh!

brprice Feb 28, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

brprice commented Feb 27, 2023

Uh oh!

brprice commented Feb 27, 2023

Uh oh!

dhess commented Feb 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dhess left a comment

Choose a reason for hiding this comment

Uh oh!

dhess Feb 27, 2023

Choose a reason for hiding this comment

Uh oh!

brprice Feb 28, 2023

Choose a reason for hiding this comment

Uh oh!

dhess Feb 28, 2023

Choose a reason for hiding this comment

Uh oh!

brprice Feb 28, 2023

Choose a reason for hiding this comment

Uh oh!

dhess Feb 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brprice Feb 28, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dhess commented Feb 27, 2023 •

edited

Loading

dhess Feb 28, 2023 •

edited

Loading