Rework of -profile=gc calculations#1846
Rework of -profile=gc calculations#1846mihails-strasuns-sociomantic wants to merge 9 commits intodlang:masterfrom
Conversation
e429767 to
b1b02d8
Compare
|
(going to abuse auto-tester a bit) |
68324f6 to
91b7fb9
Compare
leandro-lucarella-sociomantic
left a comment
There was a problem hiding this comment.
Besides the small comments and the non-in-depth review of the code generating bits. LGTM. It is much more sensible than the previous implementation.
| @@ -529,7 +529,7 @@ class ConservativeGC : GC | |||
| } | |||
|
|
|||
| gcx.log_malloc(p, size); | |||
There was a problem hiding this comment.
In some distant future it might be good to gcx.log_malloc alloc_size too...
| extern (C) void* _d_arrayliteralTX(const TypeInfo ti, size_t length); | ||
| extern (C) void* _d_assocarrayliteralTX(const TypeInfo_AssociativeArray ti, void[] keys, void[] vals); | ||
| mixin(generateTraceWrappers()); | ||
| //pragma(msg, generateTraceWrappers()); |
There was a problem hiding this comment.
I guess commented pragmas should be gone.
| accumulate(file, line, funcname, "closure", sz); | ||
| return _d_allocmemory(sz); | ||
| void foo(int x, double y) { } | ||
| static assert (Arguments!foo == "x, y, "); |
There was a problem hiding this comment.
OK, I got completely lost with all the code generation. I will let someone else more used to black magic to review it.
|
Anyone interested in having a look at this? Maybe @MartinNowak ? |
|
FYI: current state is "works like a charm for me but I haven't found time to rework how test suite runs so that it can pass on all platforms with more extensive tests added" |
|
I think this is quite important, the current profiling information given by the GC is pretty much garbage, and giving wrong information could be more dangerous than not giving any information at all. So any help here is very appreciated. Also, besides the currently not ideal testing situation and failure on some platforms, it would be nice to know if the general solution looks good for other druntime maintainers. ping @dlang/team-druntime |
|
High level design seems to reasonable to me. |
|
I just had a look at the results for FreeBSD 32 and it's just an ordering issue. I guess something is wrong (or incomplete) with b64ac1f. According to the auto-tester, the BSD 32 results are: And the one in the test case: Neither seem to be sorted by "name" for the 16 "allocated" values. |
|
Just for the records, the correct order should be: |
|
Having a quick look, But then I saw a pattern, so I think we have 2 problems here:
|
|
OK, checking:
|
|
Instead of relying on the rather inaccurate callbacks by the compiler (e.g. anything allocated in the precompiled runtime will not be detected), I recently used a proxy GC that just forwards the interface calls to the actual GC. It records allocation information in malloc/calloc/qalloc calls (including the call stack which can then be used to determine the call location). This also avoids modifing the GC interface and implementation. |
But can you really know exactly how much you allocated at that level of abstraction? How would you do that, just querying the GC to know what's the real capacity of the reserved block? |
I guess you are referring to the difference between requested allocation size, returned size and actually used size, e.g. in array code in rt.lifetime. With intercepting the GC interface, you know the requested and allocated memory precisely (using qalloc), but can only estimate how much of it is in actual use. The array interface functions yield some other information, but without knowing the exact implementation in rt.lifetime, calculating memory usage is quite some guesswork. For example, how do you interpret setlength? Will it allocate new memory, extend existing memory without allocation, or will it do nothing? How much memory will a new allocation actually use? |
|
For our purpose it does not matter how much exactly is used - only impact it makes on total memory usage by GC. With current implementation Uncoincidentally, |
|
I kind of like the idea of using proxy GC instead of compiler support but don't want to lose type information current approach provides. Having to investigate backtrace for each logged allocation is a lot of work. |
The TypeInfo parameter is supposed to give you that information, though it might not always be passed correctly now, e.g. when extending a block. The precise GC fixes that. |
7047357 to
9ec72d9
Compare
|
Huh, green |
9ec72d9 to
46aedf2
Compare
|
Ping. Any review/requests for this one? It is good to go from my side in its basic form. |
Intermediate step for generating tracegc handlers automatically intended to simplify reviewing by reducing size of each individual diff.
Replaces imprecise/wrong calculation based exclusively on checking TypeInfo - now the data is retrieved directly from the GC and represents real amount of bytes allocated. Fix issue https://issues.dlang.org/show_bug.cgi?id=17294 Fix issue https://issues.dlang.org/show_bug.cgi?id=16280 Fix issue https://issues.dlang.org/show_bug.cgi?id=15481
Ordering by name as last measure means order of lines in the generate trace will be reproducible, allowing for easy test cases.
Moves code snippet which refers to various language features that allocate to actual test to ensure better functionality coverage.
f439ea0 to
e1a6984
Compare
Causes closure allocation despite being marked as scope
|
Right now biggest performance problem comes from how crazy expensive calling |
leandro-lucarella-sociomantic
left a comment
There was a problem hiding this comment.
LGTM despite the few comments.
| extern (C) void profilegc_setlogfilename(string name) | ||
| { | ||
| logfilename = name; | ||
| logfilename = name ~ "\0"; |
There was a problem hiding this comment.
This looks unrelated to this commit.
| { | ||
| qsort(counts.ptr, counts.length, Result.sizeof, &Result.qsort_cmp); | ||
|
|
||
| FILE* fp = logfilename.length == 0 ? stdout : fopen((logfilename ~ '\0').ptr, "w"); |
There was a problem hiding this comment.
Oh, right, is because of this. I would maybe make the commit message a bit more verbose that this commit is about avoid using language features that indirectly use the GC, not about removing some explicit recursive usage.
| { | ||
| if (sz == 0) | ||
| return; | ||
|
|
There was a problem hiding this comment.
This is just an optimization, right? Also some more verbose commit message could help.
Maybe do that in a separate PR to avoid this one to get stalled? Specially because that change has to be done very carefully, as one mistake might trigger wrong results and it's very hard to track/find out (speaking from experience). |
|
ping @DmitryOlshansky @MartinNowak @andralex @rainers @wilzbach Can you merge or give us some feedback if anything needs to be changed? |
|
I would like to expand this PR much further to get a lot of other stats that are already kept by the GG (at least when profiling is enabled). Since keeping some stats might be a bit expensive, in particular timing stats (we need to overcome not having Conceptually, I think New (and this PR introduces Ideally we should be able to enable, disable and reset individual stats (and reset to a specific value rather than zero might be useful in some cases), so this can be done in a "only pay for what you need" style. But if it's too complicated, I guess we can also have a couple of categories (like "allocation" or "collection" or "counters" or "timing") or levels of detail (like "shallow" to only get the very basic stuff, like total collection time, and "deep" to get more fine-grained stats like "markTime" and other intermediate timings). It might be useful to discuss this briefly at DConf to reduce the back and forth. |
andralex
left a comment
There was a problem hiding this comment.
@leandro-lucarella-sociomantic LMK when this is ready - we can pull it now and expand it later.
A simple way to enable/disable collection of specific stats is to write -1 or some invalid value to the values that are not of interest, i.e. if numReallocs is set to -1 the reallocations won't be counted. This is of course of interest mostly for the more costly stats.
| __gshared long lockTime; | ||
|
|
||
| // thread-local counter | ||
| size_t bytesAllocated; |
There was a problem hiding this comment.
I think ulong is better here; size_t is for sizes of objects in memory, whereas bytesAllocated is a tally. It could go over 4 GB on a 32-bit system.
There was a problem hiding this comment.
I just copy&pasted what the GC is already currently gathering (when some profiling options are included). But yeah, we could review how these are being stored too (there is also the inconsistency of some timing being long and some Duration).
|
This has been superseded by #2607 |
Another attempt for #1806, now using
thread-local counter inside GC to circumvent sync issues.