DO NOT MERGE Add mechanism to install memory error handler for finding segfaults in compiler#11574
DO NOT MERGE Add mechanism to install memory error handler for finding segfaults in compiler#11574schveiguy wants to merge 1 commit intodlang:masterfrom
Conversation
src/dmd/mars.d
Outdated
| { | ||
| import core.runtime; | ||
| import core.memory; | ||
| // install the memory error handler when the environment variable is set |
There was a problem hiding this comment.
src/dmd/mars.d:908:16: error: module memoryerror is in file 'etc/linux/memoryerror.d' which cannot be read
908 | import etc.linux.memoryerror;
| ^
import path[0] = /usr/lib/gcc/x86_64-linux-gnu/9/include/d
import path[1] = /home/runner/dmd/src
As LDC or GDC might be host compilers too, I would simply use version (DigitalMars).
There was a problem hiding this comment.
If we want to merge this to test further, I will. For now, just trying to see if I can duplicate the problem on this PR.
There was a problem hiding this comment.
- AFAICT the problem appears only with
-lowmem - this should fix the problem Make -lowmem optional and prefer building d_do_test with the host compiler #11519
I still believe there's merit of keeping the memory error registration enabled as it will likely help with other hard-to-replicate bugs that appear on the CI machines.
ea039e7 to
e3130ec
Compare
|
ok @wilzbach I saw DMD master failed with the core dump. I have deprecated like 10 jobs to try and get it to happen on this PR, but we aren't going to catch it unless it's done for every test. We can revert this if people find objection to it, but I think it's a reasonable thing to include for testing. I fixed the LDC/GDC issue. |
|
Thanks for your pull request, @schveiguy! Bugzilla referencesYour PR doesn't reference any Bugzilla issue. If your PR contains non-trivial changes, please reference a Bugzilla issue or create a manual changelog. Testing this PR locallyIf you don't have a local development environment setup, you can use Digger to test this PR: dub run digger -- build "master + dmd#11574" |
|
I fear this will need a little explaining. |
No, it just allows us to see the segfault. You can read up more in the documentation of the memory error handler. |
|
@UplinkCoder if we are lucky, the segfault will print a stack trace. |
MoonlightSentinel
left a comment
There was a problem hiding this comment.
LGTM but please update the commit title & description
e3130ec to
7017bc0
Compare
|
@MoonlightSentinel done (I think this is what you meant?) |
|
Got a hit! Anyone know what this stack trace means? Unfortunately no line numbers (I think because this is the release build). |
dmd is build with |
|
Ugh, it's not a null pointer, so this is a memory corruption or dangling pointer issue. |
|
I suppose the next step is to instrument that function |
|
Well, I think #11519 (comment) should have fixed this particular issue, so I am not sure whether you'll be able to replicate it. However, it looks like the GC (-lowmem) is cleaning up memory that is still in used and that will be hard to debug remotely. AFAICT we would be more effective by trying to replicate it locally. |
|
Note that the auto-tester prefers the host dmd now that #11519 was merged. So you'll probably need to revert this change in this/another PR to replicate the failure. Also DMD needs to be build with debug information to actually make use the stack trace. @wilzbach Maybe debug information should be enabled by default on the CI's? (esp. as a complement to the pending |
Yeah, that's probably true. The other possibility is that there is some stack data that is being returned. However, from experience with memory errors, I suspect that reproducing it locally is going to be really challenging. And even if you do reproduce it locally, even in a debugger, it may be too late to see what the cause is. Using the host compiler means we likely don't hit this error any more. So I guess that's a solution. But I can rebase this and leave the PR open (It should keep testing and maybe we hit this bug again). I will try to see if using the debug build can cause the failure. |
Debug information should already always be present on the CIs... |
All CIs do rebases automatically. You would need to add a revert of the host compiler change PR. |
The auto-tester builds with |
7017bc0 to
7286cf7
Compare
|
No clue ATM but you could keep the release build and manually enable debug information instead: diff --git a/src/posix.mak b/src/posix.mak
index 2c309b00c..6840e83ba 100644
--- a/src/posix.mak
+++ b/src/posix.mak
@@ -99,7 +99,7 @@ else
HOST_DMD_RUN=$(HOST_DMD) -conf=$(dir $(HOST_DMD))dmd.conf
endif
-RUN_BUILD = $(GENERATED)/build HOST_DMD="$(HOST_DMD)" CXX="$(HOST_CXX)" OS=$(OS) BUILD=$(BUILD) MODEL=$(MODEL) AUTO_BOOTSTRAP="$(AUTO_BOOTSTRAP)" DOCDIR="$(DOCDIR)" STDDOC="$(STDDOC)" DOC_OUTPUT_DIR="$(DOC_OUTPUT_DIR)" MAKE="$(MAKE)" --called-from-make
+RUN_BUILD = $(GENERATED)/build HOST_DMD="$(HOST_DMD)" CXX="$(HOST_CXX)" OS=$(OS) BUILD=$(BUILD) MODEL=$(MODEL) AUTO_BOOTSTRAP="$(AUTO_BOOTSTRAP)" DOCDIR="$(DOCDIR)" STDDOC="$(STDDOC)" DOC_OUTPUT_DIR="$(DOC_OUTPUT_DIR)" MAKE="$(MAKE)" DFLAGS="$(DFLAGS) -g" --called-from-make
######## Begin build targets |
76d6a7f to
367f578
Compare
|
OK, I did that. You think this will work? I suppose actually debug info is more important than debug build -- changing to debug build might affect whether the segfault happens. |
|
Yes, the last few FreeBSD runs segfaulted using the May need to overwrite |
|
By "still work" I meant, will it now print line numbers ;) |
9416911 to
ed2f553
Compare
|
EDIT: To many tabs |
|
And I'm waiting for my latest change to be tested, which I hope works... |
ed2f553 to
38d61da
Compare
|
OK, I think this is back to actually working how it should. Now we just need another coredump... |
38d61da to
f287673
Compare
|
Rebased it again (removed |
|
@MoonlightSentinel thanks! |
|
Another hit, this time with line numbers, and on 32-bit linux. |
|
For future reference, this was the code at the time: https://github.com/dlang/dmd/tree/4d2ee79c7c4d644875993600af88d340b030d437/src/dmd |
|
TypeFunction is a final class, so no virtual calls. It really does look like a GC issue. |
|
How about merging the memory handler s.t. we can use it for the CI? (now that we know the details of that crash). If so, I would like to see some minor changes:
Maybe move the registration into a dedicated function- |
|
Yeah, ok. I still want to find the original crash. But I think this would require things like memory sentinels (which I'm sure don't work out of the box here). I'll create a separate PR. |
|
Another Linux 32 failure happened. |
f287673 to
76da384
Compare
76da384 to
34dcd3f
Compare
|
@schveiguy What is the status of this PR? I see that #11646 was merged, so is this PR still needed? |
|
The point of this was to be able to get information from the only environment where DMD seems to crash -- the auto tester. This has been green for a while, and looking at the freebsd boxes (where the crash mostly happened), they have pretty good records at the moment. I'll close this, we can always do it again if it starts becoming an issue. |
|
@schveiguy Thanks for the prompt response! |
This is simply a test. It should run the specified test with the memory error signal handler enabled.Now going to try and instrument all Linux tests to catch the segfault error. After we find it we can move the d_do_test back to a normal run.