Skip to content

feat: nixos-bench reports ghc stats#890

Merged
brprice merged 1 commit intomainfrom
brprice/nixos-bench-ghc-stats
Mar 2, 2023
Merged

feat: nixos-bench reports ghc stats#890
brprice merged 1 commit intomainfrom
brprice/nixos-bench-ghc-stats

Conversation

@brprice
Copy link
Copy Markdown
Contributor

@brprice brprice commented Feb 27, 2023

We explicitly whitelist a small selection of the reported statistics, to make the results easier to read.

@brprice
Copy link
Copy Markdown
Contributor Author

brprice commented Feb 27, 2023

GHC spits out quite a lot of stats, from which I have selected a few entries which seemed most useful. I'd appreciate feedback on whether this selection is optimal.

The raw data from GHC looks like (click to unfold)
'/nix/store/nrihhkxm7r6gclc9bzs69nzs5s4kynpn-primer-service-exe-primer-service-0.7.2.0/bin/primer-service' 'serve' 'vUNKNOWN' '--port' '8081' +RTS '-N' '-t/tmp/ghc-stats' '--machine-readable' 
 [("bytes allocated", "407616")
 ,("num_GCs", "4")
 ,("average_bytes_used", "107088")
 ,("max_bytes_used", "115392")
 ,("num_byte_usage_samples", "4")
 ,("peak_megabytes_allocated", "6")
 ,("init_cpu_seconds", "0.005759")
 ,("init_wall_seconds", "0.053852")
 ,("mut_cpu_seconds", "0.048533")
 ,("mut_wall_seconds", "20.303372")
 ,("GC_cpu_seconds", "0.224454")
 ,("GC_wall_seconds", "1.306344")
 ,("exit_cpu_seconds", "0.002124")
 ,("exit_wall_seconds", "0.005371")
 ,("total_cpu_seconds", "0.280879")
 ,("total_wall_seconds", "21.668947")
 ,("major_gcs", "4")
 ,("allocated_bytes", "407616")
 ,("max_live_bytes", "115392")
 ,("max_large_objects_bytes", "90184")
 ,("max_compact_bytes", "0")
 ,("max_slop_bytes", "28888")
 ,("max_mem_in_use_bytes", "6291456")
 ,("cumulative_live_bytes", "428352")
 ,("copied_bytes", "93440")
 ,("par_copied_bytes", "0")
 ,("cumulative_par_max_copied_bytes", "0")
 ,("cumulative_par_balanced_copied_bytes", "0")
 ,("fragmentation_bytes", "0")
 ,("alloc_rate", "8398741")
 ,("productivity_cpu_percent", "0.172823")
 ,("productivity_wall_percent", "0.936980")
 ,("bound_task_count", "1")
 ,("sparks_count", "0")
 ,("sparks_converted", "0")
 ,("sparks_overflowed", "0")
 ,("sparks_dud ", "0")
 ,("sparks_gcd", "0")
 ,("sparks_fizzled", "0")
 ,("work_balance", "0.000000")
 ,("n_capabilities", "1")
 ,("task_count", "5")
 ,("peak_worker_count", "4")
 ,("worker_count", "4")
 ,("gc_alloc_block_sync_spin", "0")
 ,("gc_alloc_block_sync_yield", "0")
 ,("gc_alloc_block_sync_spin", "0")
 ,("waitForGcThreads_spin", "0")
 ,("waitForGcThreads_yield", "0")
 ,("whitehole_gc_spin", "0")
 ,("whitehole_lockClosure_spin", "0")
 ,("whitehole_lockClosure_yield", "0")
 ,("whitehole_executeMessage_spin", "0")
 ,("whitehole_threadPaused_spin", "0")
 ,("any_work", "0")
 ,("scav_find_work", "0")
 ,("max_n_todo_overflow", "0")
 ,("gen_0_collections", "0")
 ,("gen_0_par_collections", "0")
 ,("gen_0_cpu_seconds", "0.000000")
 ,("gen_0_wall_seconds", "0.000000")
 ,("gen_0_max_pause_seconds", "0.000000")
 ,("gen_0_avg_pause_seconds", "0.000000")
 ,("gen_0_sync_spin", "0")
 ,("gen_0_sync_yield", "0")
 ,("gen_1_collections", "4")
 ,("gen_1_par_collections", "0")
 ,("gen_1_cpu_seconds", "0.224454")
 ,("gen_1_wall_seconds", "1.306344")
 ,("gen_1_max_pause_seconds", "1.300824")
 ,("gen_1_avg_pause_seconds", "0.326586")
 ,("gen_1_sync_spin", "0")
 ,("gen_1_sync_yield", "0")
 ]

@brprice brprice force-pushed the brprice/nixos-bench-ghc-stats branch from fc4c297 to 2eca1d8 Compare February 27, 2023 15:28
@brprice brprice requested a review from a team February 27, 2023 15:28
@dhess
Copy link
Copy Markdown
Member

dhess commented Feb 27, 2023

Using https://hackage.haskell.org/package/base-4.17.0.0/docs/GHC-Stats.html as a reference, and assuming they mean what I think they mean, I suggest at least:

"num_gcs"
"max_mem_in_use_bytes"
"GC_cpu_seconds"

Open questions:

  • It appears there's no way to get the average bytes in use, is that right?
  • I don't think we particular care about the live bytes numbers, am I wrong? What we really care about is how much memory the process is taking, so that we can appropriately size our VM instances. Does GHC look at available memory and use that to decide how often to GC? (I seem to remember a discussion awhile ago about how we could use some flags to the RTS to say, "try to stay under this amount of allocated memory"; might be a good time to turn that on, at least for these tests.)
  • "gen_1_max_pause_seconds" and "gen_1_avg_pause_seconds", I'm guessing these refer to the nonmoving GC "post-mark pause phase." Is this the amount of time that program execution is paused during a GC?

Copy link
Copy Markdown
Member

@dhess dhess left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

primer.fail("journalctl -b -u primer | grep -q ERROR")
primer.succeed("systemctl stop primer")
ghcstats = primer.succeed("tail -n+2 /tmp/ghc-stats")
for (k,v) in ast.literal_eval(ghcstats):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't want to do this in production code, but it's a pretty nice hack for a benchmark test.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if we want to get these stats in production code, we should use https://downloads.haskell.org/ghc/latest/docs/libraries/base-4.17.0.0/GHC-Stats.html to expose stats from a running service.

Actually, come to think of it, don't we get live stats from our dev deployment? How does that work?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we use a Prometheus scraper to monitor live instances.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's interesting. I could instead make a request to /metrics and scrape that, rather than killing the server. I might look into that. Those metrics look like

Details
# HELP http_request_duration_seconds The HTTP request latencies in seconds.
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{handler="app",method="GET",status_code="200",le="0.005"} 1
http_request_duration_seconds_bucket{handler="app",method="GET",status_code="200",le="0.01"} 1
http_request_duration_seconds_bucket{handler="app",method="GET",status_code="200",le="0.025"} 1
http_request_duration_seconds_bucket{handler="app",method="GET",status_code="200",le="0.05"} 1
http_request_duration_seconds_bucket{handler="app",method="GET",status_code="200",le="0.1"} 1
http_request_duration_seconds_bucket{handler="app",method="GET",status_code="200",le="0.25"} 1
http_request_duration_seconds_bucket{handler="app",method="GET",status_code="200",le="0.5"} 1
http_request_duration_seconds_bucket{handler="app",method="GET",status_code="200",le="1.0"} 1
http_request_duration_seconds_bucket{handler="app",method="GET",status_code="200",le="2.5"} 1
http_request_duration_seconds_bucket{handler="app",method="GET",status_code="200",le="5.0"} 1
http_request_duration_seconds_bucket{handler="app",method="GET",status_code="200",le="10.0"} 1
http_request_duration_seconds_bucket{handler="app",method="GET",status_code="200",le="+Inf"} 1
http_request_duration_seconds_sum{handler="app",method="GET",status_code="200"} 5.9314e-5
http_request_duration_seconds_count{handler="app",method="GET",status_code="200"} 1
http_request_duration_seconds_bucket{handler="prometheus",method="GET",status_code="200",le="0.005"} 3
http_request_duration_seconds_bucket{handler="prometheus",method="GET",status_code="200",le="0.01"} 3
http_request_duration_seconds_bucket{handler="prometheus",method="GET",status_code="200",le="0.025"} 3
http_request_duration_seconds_bucket{handler="prometheus",method="GET",status_code="200",le="0.05"} 3
http_request_duration_seconds_bucket{handler="prometheus",method="GET",status_code="200",le="0.1"} 3
http_request_duration_seconds_bucket{handler="prometheus",method="GET",status_code="200",le="0.25"} 3
http_request_duration_seconds_bucket{handler="prometheus",method="GET",status_code="200",le="0.5"} 3
http_request_duration_seconds_bucket{handler="prometheus",method="GET",status_code="200",le="1.0"} 3
http_request_duration_seconds_bucket{handler="prometheus",method="GET",status_code="200",le="2.5"} 3
http_request_duration_seconds_bucket{handler="prometheus",method="GET",status_code="200",le="5.0"} 3
http_request_duration_seconds_bucket{handler="prometheus",method="GET",status_code="200",le="10.0"} 3
http_request_duration_seconds_bucket{handler="prometheus",method="GET",status_code="200",le="+Inf"} 3
http_request_duration_seconds_sum{handler="prometheus",method="GET",status_code="200"} 2.91299e-4
http_request_duration_seconds_count{handler="prometheus",method="GET",status_code="200"} 3
# HELP ghc_gcs_total Total number of GCs
# TYPE ghc_gcs_total counter
ghc_gcs_total 17
# HELP ghc_major_gcs_total Total number of major (oldest generation) GCs
# TYPE ghc_major_gcs_total counter
ghc_major_gcs_total 16
# HELP ghc_allocated_bytes_total Total bytes allocated
# TYPE ghc_allocated_bytes_total counter
ghc_allocated_bytes_total 9171656
# HELP ghc_max_live_bytes Maximum live data (including large objects + compact regions)
# TYPE ghc_max_live_bytes gauge
ghc_max_live_bytes 390168
# HELP ghc_max_large_objects_bytes Maximum live data in large objects
# TYPE ghc_max_large_objects_bytes gauge
ghc_max_large_objects_bytes 116808
# HELP ghc_max_compact_bytes Maximum live data in compact regions
# TYPE ghc_max_compact_bytes gauge
ghc_max_compact_bytes 0
# HELP ghc_max_slop_bytes Maximum slop
# TYPE ghc_max_slop_bytes gauge
ghc_max_slop_bytes 56528
# HELP ghc_max_mem_in_use_bytes Maximum memory in use by the RTS
# TYPE ghc_max_mem_in_use_bytes gauge
ghc_max_mem_in_use_bytes 23068672
# HELP ghc_cumulative_live_bytes_total Sum of live bytes across all major GCs. Divided by major_gcs gives the average live data over the lifetime of the program.
# TYPE ghc_cumulative_live_bytes_total counter
ghc_cumulative_live_bytes_total 5366288
# HELP ghc_copied_bytes_total Sum of copied_bytes across all GCs
# TYPE ghc_copied_bytes_total counter
ghc_copied_bytes_total 3967168
# HELP ghc_par_copied_bytes_total Sum of copied_bytes across all parallel GCs
# TYPE ghc_par_copied_bytes_total counter
ghc_par_copied_bytes_total 3967168
# HELP ghc_cumulative_par_max_copied_bytes_total Sum of par_max_copied_bytes across all parallel GCs
# TYPE ghc_cumulative_par_max_copied_bytes_total counter
ghc_cumulative_par_max_copied_bytes_total 1819704
# HELP ghc_mutator_cpu_seconds_total Total CPU time used by the mutator
# TYPE ghc_mutator_cpu_seconds_total counter
ghc_mutator_cpu_seconds_total 0.043725594
# HELP ghc_mutator_elapsed_seconds_total Total elapsed time used by the mutator
# TYPE ghc_mutator_elapsed_seconds_total counter
ghc_mutator_elapsed_seconds_total 238.947842380
# HELP ghc_gc_cpu_seconds_total Total CPU time used by the GC
# TYPE ghc_gc_cpu_seconds_total counter
ghc_gc_cpu_seconds_total 0.092954949
# HELP ghc_gc_elapsed_seconds_total Total elapsed time used by the GC
# TYPE ghc_gc_elapsed_seconds_total counter
ghc_gc_elapsed_seconds_total 0.056046031
# HELP ghc_cpu_seconds_total Total CPU time (at the previous GC)
# TYPE ghc_cpu_seconds_total counter
ghc_cpu_seconds_total 0.136680543
# HELP ghc_elapsed_seconds_total Total elapsed time (at the previous GC)
# TYPE ghc_elapsed_seconds_total counter
ghc_elapsed_seconds_total 239.003888411
# HELP ghc_gcdetails_gen The generation number of this GC
# TYPE ghc_gcdetails_gen histogram
ghc_gcdetails_gen 1
# HELP ghc_gcdetails_threads Number of threads used in this GC
# TYPE ghc_gcdetails_threads gauge
ghc_gcdetails_threads 4
# HELP ghc_gcdetails_allocated_bytes Number of bytes allocated since the previous GC
# TYPE ghc_gcdetails_allocated_bytes gauge
ghc_gcdetails_allocated_bytes 1712
# HELP ghc_gcdetails_live_bytes Total amount of live data in the heap (including large + compact data)
# TYPE ghc_gcdetails_live_bytes gauge
ghc_gcdetails_live_bytes 382216
# HELP ghc_gcdetails_large_objects_bytes Total amount of live data in large objects
# TYPE ghc_gcdetails_large_objects_bytes gauge
ghc_gcdetails_large_objects_bytes 111264
# HELP ghc_gcdetails_compact_bytes Total amount of live data in compact regions
# TYPE ghc_gcdetails_compact_bytes gauge
ghc_gcdetails_compact_bytes 0
# HELP ghc_gcdetails_slop_bytes Total amount of slop (wasted memory)
# TYPE ghc_gcdetails_slop_bytes gauge
ghc_gcdetails_slop_bytes 47864
# HELP ghc_gcdetails_mem_in_use_bytes Total amount of memory in use by the RTS
# TYPE ghc_gcdetails_mem_in_use_bytes counter
ghc_gcdetails_mem_in_use_bytes 23068672
# HELP ghc_gcdetails_copied_bytes Total amount of data copied during this GC
# TYPE ghc_gcdetails_copied_bytes gauge
ghc_gcdetails_copied_bytes 272112
# HELP ghc_gcdetails_par_max_copied_bytes In parallel GC, the max amount of data copied by any one thread
# TYPE ghc_gcdetails_par_max_copied_bytes gauge
ghc_gcdetails_par_max_copied_bytes 90112
# HELP ghc_gcdetails_sync_elapsed_seconds The time elapsed during synchronisation before GC
# TYPE ghc_gcdetails_sync_elapsed_seconds gauge
ghc_gcdetails_sync_elapsed_seconds 0.000127675
# HELP ghc_gcdetails_cpu_seconds The CPU time used during GC itself
# TYPE ghc_gcdetails_cpu_seconds gauge
ghc_gcdetails_cpu_seconds 0.006119019
# HELP ghc_gcdetails_elapsed_seconds The time elapsed during GC itself
# TYPE ghc_gcdetails_elapsed_seconds gauge
ghc_gcdetails_elapsed_seconds 0.002675058

Copy link
Copy Markdown
Member

@dhess dhess Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it would also be a good test to ensure that the Prometheus endpoint is working, but
I'm satisfied with the current approach.

On second thought, it is probably just as easy to extract the Prometheus metrics and parse those, and this approach would work with the Docker image as well, if we want to switch to using that rather than a bespoke NixOS service, so sure, let's go for it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have done this change in 1b9ee56 (separate commit to aid reviewing), which I'll squash before merging

We explicitly whitelist a small selection of the reported statistics, to
make the results easier to read.
@brprice brprice force-pushed the brprice/nixos-bench-ghc-stats branch from 1b9ee56 to 01b0171 Compare March 2, 2023 11:13
@brprice brprice enabled auto-merge March 2, 2023 11:13
@brprice brprice added this pull request to the merge queue Mar 2, 2023
Merged via the queue into main with commit a9fba0a Mar 2, 2023
@brprice brprice deleted the brprice/nixos-bench-ghc-stats branch March 2, 2023 11:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants