Conversation
|
GHC spits out quite a lot of stats, from which I have selected a few entries which seemed most useful. I'd appreciate feedback on whether this selection is optimal. The raw data from GHC looks like (click to unfold) |
fc4c297 to
2eca1d8
Compare
|
Using https://hackage.haskell.org/package/base-4.17.0.0/docs/GHC-Stats.html as a reference, and assuming they mean what I think they mean, I suggest at least: Open questions:
|
| primer.fail("journalctl -b -u primer | grep -q ERROR") | ||
| primer.succeed("systemctl stop primer") | ||
| ghcstats = primer.succeed("tail -n+2 /tmp/ghc-stats") | ||
| for (k,v) in ast.literal_eval(ghcstats): |
There was a problem hiding this comment.
I wouldn't want to do this in production code, but it's a pretty nice hack for a benchmark test.
There was a problem hiding this comment.
Yeah, if we want to get these stats in production code, we should use https://downloads.haskell.org/ghc/latest/docs/libraries/base-4.17.0.0/GHC-Stats.html to expose stats from a running service.
Actually, come to think of it, don't we get live stats from our dev deployment? How does that work?
There was a problem hiding this comment.
Yes, we use a Prometheus scraper to monitor live instances.
There was a problem hiding this comment.
Ah, that's interesting. I could instead make a request to /metrics and scrape that, rather than killing the server. I might look into that. Those metrics look like
Details
# HELP http_request_duration_seconds The HTTP request latencies in seconds.
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{handler="app",method="GET",status_code="200",le="0.005"} 1
http_request_duration_seconds_bucket{handler="app",method="GET",status_code="200",le="0.01"} 1
http_request_duration_seconds_bucket{handler="app",method="GET",status_code="200",le="0.025"} 1
http_request_duration_seconds_bucket{handler="app",method="GET",status_code="200",le="0.05"} 1
http_request_duration_seconds_bucket{handler="app",method="GET",status_code="200",le="0.1"} 1
http_request_duration_seconds_bucket{handler="app",method="GET",status_code="200",le="0.25"} 1
http_request_duration_seconds_bucket{handler="app",method="GET",status_code="200",le="0.5"} 1
http_request_duration_seconds_bucket{handler="app",method="GET",status_code="200",le="1.0"} 1
http_request_duration_seconds_bucket{handler="app",method="GET",status_code="200",le="2.5"} 1
http_request_duration_seconds_bucket{handler="app",method="GET",status_code="200",le="5.0"} 1
http_request_duration_seconds_bucket{handler="app",method="GET",status_code="200",le="10.0"} 1
http_request_duration_seconds_bucket{handler="app",method="GET",status_code="200",le="+Inf"} 1
http_request_duration_seconds_sum{handler="app",method="GET",status_code="200"} 5.9314e-5
http_request_duration_seconds_count{handler="app",method="GET",status_code="200"} 1
http_request_duration_seconds_bucket{handler="prometheus",method="GET",status_code="200",le="0.005"} 3
http_request_duration_seconds_bucket{handler="prometheus",method="GET",status_code="200",le="0.01"} 3
http_request_duration_seconds_bucket{handler="prometheus",method="GET",status_code="200",le="0.025"} 3
http_request_duration_seconds_bucket{handler="prometheus",method="GET",status_code="200",le="0.05"} 3
http_request_duration_seconds_bucket{handler="prometheus",method="GET",status_code="200",le="0.1"} 3
http_request_duration_seconds_bucket{handler="prometheus",method="GET",status_code="200",le="0.25"} 3
http_request_duration_seconds_bucket{handler="prometheus",method="GET",status_code="200",le="0.5"} 3
http_request_duration_seconds_bucket{handler="prometheus",method="GET",status_code="200",le="1.0"} 3
http_request_duration_seconds_bucket{handler="prometheus",method="GET",status_code="200",le="2.5"} 3
http_request_duration_seconds_bucket{handler="prometheus",method="GET",status_code="200",le="5.0"} 3
http_request_duration_seconds_bucket{handler="prometheus",method="GET",status_code="200",le="10.0"} 3
http_request_duration_seconds_bucket{handler="prometheus",method="GET",status_code="200",le="+Inf"} 3
http_request_duration_seconds_sum{handler="prometheus",method="GET",status_code="200"} 2.91299e-4
http_request_duration_seconds_count{handler="prometheus",method="GET",status_code="200"} 3
# HELP ghc_gcs_total Total number of GCs
# TYPE ghc_gcs_total counter
ghc_gcs_total 17
# HELP ghc_major_gcs_total Total number of major (oldest generation) GCs
# TYPE ghc_major_gcs_total counter
ghc_major_gcs_total 16
# HELP ghc_allocated_bytes_total Total bytes allocated
# TYPE ghc_allocated_bytes_total counter
ghc_allocated_bytes_total 9171656
# HELP ghc_max_live_bytes Maximum live data (including large objects + compact regions)
# TYPE ghc_max_live_bytes gauge
ghc_max_live_bytes 390168
# HELP ghc_max_large_objects_bytes Maximum live data in large objects
# TYPE ghc_max_large_objects_bytes gauge
ghc_max_large_objects_bytes 116808
# HELP ghc_max_compact_bytes Maximum live data in compact regions
# TYPE ghc_max_compact_bytes gauge
ghc_max_compact_bytes 0
# HELP ghc_max_slop_bytes Maximum slop
# TYPE ghc_max_slop_bytes gauge
ghc_max_slop_bytes 56528
# HELP ghc_max_mem_in_use_bytes Maximum memory in use by the RTS
# TYPE ghc_max_mem_in_use_bytes gauge
ghc_max_mem_in_use_bytes 23068672
# HELP ghc_cumulative_live_bytes_total Sum of live bytes across all major GCs. Divided by major_gcs gives the average live data over the lifetime of the program.
# TYPE ghc_cumulative_live_bytes_total counter
ghc_cumulative_live_bytes_total 5366288
# HELP ghc_copied_bytes_total Sum of copied_bytes across all GCs
# TYPE ghc_copied_bytes_total counter
ghc_copied_bytes_total 3967168
# HELP ghc_par_copied_bytes_total Sum of copied_bytes across all parallel GCs
# TYPE ghc_par_copied_bytes_total counter
ghc_par_copied_bytes_total 3967168
# HELP ghc_cumulative_par_max_copied_bytes_total Sum of par_max_copied_bytes across all parallel GCs
# TYPE ghc_cumulative_par_max_copied_bytes_total counter
ghc_cumulative_par_max_copied_bytes_total 1819704
# HELP ghc_mutator_cpu_seconds_total Total CPU time used by the mutator
# TYPE ghc_mutator_cpu_seconds_total counter
ghc_mutator_cpu_seconds_total 0.043725594
# HELP ghc_mutator_elapsed_seconds_total Total elapsed time used by the mutator
# TYPE ghc_mutator_elapsed_seconds_total counter
ghc_mutator_elapsed_seconds_total 238.947842380
# HELP ghc_gc_cpu_seconds_total Total CPU time used by the GC
# TYPE ghc_gc_cpu_seconds_total counter
ghc_gc_cpu_seconds_total 0.092954949
# HELP ghc_gc_elapsed_seconds_total Total elapsed time used by the GC
# TYPE ghc_gc_elapsed_seconds_total counter
ghc_gc_elapsed_seconds_total 0.056046031
# HELP ghc_cpu_seconds_total Total CPU time (at the previous GC)
# TYPE ghc_cpu_seconds_total counter
ghc_cpu_seconds_total 0.136680543
# HELP ghc_elapsed_seconds_total Total elapsed time (at the previous GC)
# TYPE ghc_elapsed_seconds_total counter
ghc_elapsed_seconds_total 239.003888411
# HELP ghc_gcdetails_gen The generation number of this GC
# TYPE ghc_gcdetails_gen histogram
ghc_gcdetails_gen 1
# HELP ghc_gcdetails_threads Number of threads used in this GC
# TYPE ghc_gcdetails_threads gauge
ghc_gcdetails_threads 4
# HELP ghc_gcdetails_allocated_bytes Number of bytes allocated since the previous GC
# TYPE ghc_gcdetails_allocated_bytes gauge
ghc_gcdetails_allocated_bytes 1712
# HELP ghc_gcdetails_live_bytes Total amount of live data in the heap (including large + compact data)
# TYPE ghc_gcdetails_live_bytes gauge
ghc_gcdetails_live_bytes 382216
# HELP ghc_gcdetails_large_objects_bytes Total amount of live data in large objects
# TYPE ghc_gcdetails_large_objects_bytes gauge
ghc_gcdetails_large_objects_bytes 111264
# HELP ghc_gcdetails_compact_bytes Total amount of live data in compact regions
# TYPE ghc_gcdetails_compact_bytes gauge
ghc_gcdetails_compact_bytes 0
# HELP ghc_gcdetails_slop_bytes Total amount of slop (wasted memory)
# TYPE ghc_gcdetails_slop_bytes gauge
ghc_gcdetails_slop_bytes 47864
# HELP ghc_gcdetails_mem_in_use_bytes Total amount of memory in use by the RTS
# TYPE ghc_gcdetails_mem_in_use_bytes counter
ghc_gcdetails_mem_in_use_bytes 23068672
# HELP ghc_gcdetails_copied_bytes Total amount of data copied during this GC
# TYPE ghc_gcdetails_copied_bytes gauge
ghc_gcdetails_copied_bytes 272112
# HELP ghc_gcdetails_par_max_copied_bytes In parallel GC, the max amount of data copied by any one thread
# TYPE ghc_gcdetails_par_max_copied_bytes gauge
ghc_gcdetails_par_max_copied_bytes 90112
# HELP ghc_gcdetails_sync_elapsed_seconds The time elapsed during synchronisation before GC
# TYPE ghc_gcdetails_sync_elapsed_seconds gauge
ghc_gcdetails_sync_elapsed_seconds 0.000127675
# HELP ghc_gcdetails_cpu_seconds The CPU time used during GC itself
# TYPE ghc_gcdetails_cpu_seconds gauge
ghc_gcdetails_cpu_seconds 0.006119019
# HELP ghc_gcdetails_elapsed_seconds The time elapsed during GC itself
# TYPE ghc_gcdetails_elapsed_seconds gauge
ghc_gcdetails_elapsed_seconds 0.002675058
There was a problem hiding this comment.
I suppose it would also be a good test to ensure that the Prometheus endpoint is working, but
I'm satisfied with the current approach.
On second thought, it is probably just as easy to extract the Prometheus metrics and parse those, and this approach would work with the Docker image as well, if we want to switch to using that rather than a bespoke NixOS service, so sure, let's go for it.
There was a problem hiding this comment.
I have done this change in 1b9ee56 (separate commit to aid reviewing), which I'll squash before merging
We explicitly whitelist a small selection of the reported statistics, to make the results easier to read.
1b9ee56 to
01b0171
Compare
We explicitly whitelist a small selection of the reported statistics, to make the results easier to read.