Skip to content

Comments

fix Issue 21114 - core.exception.AssertError@std/socket.d(1004): Asse…#7579

Closed
WalterBright wants to merge 1 commit intodlang:masterfrom
WalterBright:fix21114
Closed

fix Issue 21114 - core.exception.AssertError@std/socket.d(1004): Asse…#7579
WalterBright wants to merge 1 commit intodlang:masterfrom
WalterBright:fix21114

Conversation

@WalterBright
Copy link
Member

…rtion failure

Using asserts to detect environmental errors is just wrong.

@dlang-bot
Copy link
Contributor

Thanks for your pull request, @WalterBright!

Bugzilla references

Auto-close Bugzilla Severity Description
21114 normal core.exception.AssertError@std/socket.d(1004): Assertion failure

Testing this PR locally

If you don't have a local development environment setup, you can use Digger to test this PR:

dub run digger -- build "master + phobos#7579"

@Geod24
Copy link
Member

Geod24 commented Aug 5, 2020

Using asserts to detect environmental errors is just wrong.

unittest that do IO are just wrong

@CyberShadow
Copy link
Member

This is already in a softUnittest block.

@CyberShadow
Copy link
Member

What does this PR fix? The error is informational and does not cause the test suite to fail.

Copy link
Member

@CyberShadow CyberShadow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what this is supposed to fix, but if the problem is harmless but "scary" messages in the CI logs, I think a better solution would be to disable softUnittest blocks when running in CI, but leave them on when users run tests individually.

@WalterBright
Copy link
Member Author

What does this PR fix? The error is informational and does not cause the test suite to fail.

Because when I get a thousand line log file that displays in a 20 line window with the last line that says "failed somewhere in the last thousand lines", and I scroll back in that miserably tiny window, I run into these seg fault traces with "don't mind me, I'm not important" and I wonder if they are the actual failure or not. I'm really tired of useless, misleading log files.

Assert failures for PROGRAMMING BUGS and NOT ENVIRONMENTAL ISSUES.

You can change it another way if you prefer, but the only reason a benign stack trace should ever appear in a log file is if the test is testing the stack trace. A stack trace is not informational at all for environmental errors, and has no place in the log file.

@CyberShadow
Copy link
Member

Assert failures for PROGRAMMING BUGS and NOT ENVIRONMENTAL ISSUES.

The assert is correctly testing the module's code.. it's just that, unfortunately, there is no easy way to do this test without relying on the environment (in this case, access to the public DNS being set up and working). In principle, it is no different than std.file using real files on disk to test its code, except that generally we can expect the host OS's filesystem to work as expected.

You can change it another way if you prefer, but the only reason a benign stack trace should ever appear in a log file is if the test is testing the stack trace. A stack trace is not informational at all for environmental errors, and has no place in the log file.

I don't think there is a general solution to this problem. There will always be a class of tests where we can't reliably programmatically know if the test is failing because there is a problem with the environment that the test is running on, or if someone introduced a regression.

Instead of getting angry and typing in all caps, why not put that energy into adjusting this change as I suggested?

An alternative would be to hide the stack trace (i.e. write .message instead of .toString()), if that's the issue that you find is in need of improvement.

@WalterBright
Copy link
Member Author

I tried to link to that hated log file, but now that environmental "informational" error didn't happen. Please, all random heisenbugs need to be removed from the test suite. There are so many tests being run now that have heisenbugs its becoming a miracle to get a PR to pass them all. Rerunning the test suite and it just fails again with another heisenbug in another test.

The current heisenbug is here: https://app.circleci.com/pipelines/github/dlang/dmd/9002/workflows/994d8911-0c38-40e5-a3e2-3615caccd480/jobs/36578

Patiently scrolling back in that miserable window eventually leads one to this seg fault:

 ... runnable/test18534.d           -O  -fPIC ()
test_results/runner(_D4core7runtime18runModuleUnitTestsUZ19unittestSegvHandlerUNbiPSQCk3sys5posix6signal9siginfo_tPvZv+0x55)[0x55918596f371]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x128a0)[0x7f73875638a0]
test_results/runner(_ZN12TypeFunction9isnothrowEb+0x1a)[0x559185834f2a]
test_results/runner(_ZN12TypeFunction10syntaxCopyEv+0xa2)[0x559185832f02]
test_results/runner(_ZN22DsymbolSemanticVisitor23funcDeclarationSemanticEP15FuncDeclaration+0x3dd)[0x5591857d4601]
test_results/runner(_ZN22DsymbolSemanticVisitor5visitEP15FuncDeclaration+0x1d)[0x5591857d7021]
test_results/runner(_ZN15FuncDeclaration6acceptEP7Visitor+0x22)[0x55918580f64a]
test_results/runner(_Z15dsymbolSemanticP7DsymbolP5Scope+0x42)[0x5591857cbe96]
test_results/runner(_ZN22DsymbolSemanticVisitor14attribSemanticEP17AttribDeclaration+0xab)[0x5591857cfce7]
test_results/runner(_ZN22DsymbolSemanticVisitor5visitEP17AttribDeclaration+0x1d)[0x5591857cfd4d]
test_results/runner(_ZN16ParseTimeVisitorI10ASTCodegenE5visitEP15LinkDeclaration+0x1f)[0x5591857f5ca7]
test_results/runner(_ZN15LinkDeclaration6acceptEP7Visitor+0x22)[0x5591857df946]
test_results/runner(_Z15dsymbolSemanticP7DsymbolP5Scope+0x42)[0x5591857cbe96]
test_results/runner(_ZN22DsymbolSemanticVisitor14attribSemanticEP17AttribDeclaration+0xab)[0x5591857cfce7]
test_results/runner(_ZN22DsymbolSemanticVisitor5visitEP17AttribDeclaration+0x1d)[0x5591857cfd4d]
test_results/runner(_ZN16ParseTimeVisitorI10ASTCodegenE5visitEP15ProtDeclaration+0x1f)[0x5591857f5d5b]
test_results/runner(_ZN15ProtDeclaration6acceptEP7Visitor+0x22)[0x5591857e0176]
test_results/runner(_Z15dsymbolSemanticP7DsymbolP5Scope+0x42)[0x5591857cbe96]
test_results/runner(_D3dmd10dsymbolsem22DsymbolSemanticVisitor5visitMRCQBx7dmodule6ModuleZ__T9__lambda2TCQDf7dsymbol7DsymbolZQBgMFQBaZv+0x21)[0x5591857d1731]
test_results/runner(_D3dmd7dsymbol14foreachDsymbolFPSQBf4root5array__T5ArrayTCQCeQCd7DsymbolZQxMDFQvZvZv+0x78)[0x559185783b04]
test_results/runner(_ZN22DsymbolSemanticVisitor5visitEP6Module+0x6c)[0x5591857d16c0]
test_results/runner(_ZN6Module6acceptEP7Visitor+0x22)[0x559185777ef6]
test_results/runner(_Z15dsymbolSemanticP7DsymbolP5Scope+0x42)[0x5591857cbe96]
test_results/runner(_ZN22DsymbolSemanticVisitor5visitEP6Import+0x42b)[0x5591857cf3af]
test_results/runner(_ZN6Import6acceptEP7Visitor+0x1f)[0x5591858494cb]
test_results/runner(_Z15dsymbolSemanticP7DsymbolP5Scope+0x42)[0x5591857cbe96]
test_results/runner(_D3dmd10dsymbolsem22DsymbolSemanticVisitor5visitMRCQBx7dmodule6ModuleZ__T9__lambda2TCQDf7dsymbol7DsymbolZQBgMFQBaZv+0x21)[0x5591857d1731]
test_results/runner(_D3dmd7dsymbol14foreachDsymbolFPSQBf4root5array__T5ArrayTCQCeQCd7DsymbolZQxMDFQvZvZv+0x78)[0x559185783b04]
test_results/runner(_ZN22DsymbolSemanticVisitor5visitEP6Module+0x6c)[0x5591857d16c0]
test_results/runner(_ZN6Module6acceptEP7Visitor+0x22)[0x559185777ef6]
test_results/runner(_Z15dsymbolSemanticP7DsymbolP5Scope+0x42)[0x5591857cbe96]
test_results/runner(_D3dmd8frontend12fullSemanticFCQBd7dmodule6ModuleZv+0x2f)[0x559185731677]
test_results/runner(_D7support8compilesFAyaQdZxSQBa17CompilationResult+0xad)[0x5591856b17c9]
test_results/runner(_D10interfaces27check_implementations_2086117__unittest_L32_C5FZv+0x43)[0x5591856b0cdf]
test_results/runner(_D6runner14unitTestRunnerFZS4core7runtime14UnitTestResult+0x16a)[0x5591856df312]
test_results/runner(runModuleUnitTests+0xaf)[0x55918596f0df]
test_results/runner(_D2rt6dmain212_d_run_main2UAAamPUQgZiZ6runAllMFZv+0x26)[0x5591859452ce]
test_results/runner(_D2rt6dmain212_d_run_main2UAAamPUQgZiZ7tryExecMFMDFZvZv+0x31)[0x559185945249]
test_results/runner(_d_run_main2+0x2ca)[0x5591859451ae]
test_results/runner(_d_run_main+0x10b)[0x559185944edb]
test_results/runner(main+0x22)[0x5591856e585e]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f73869c3b97]
test_results/runner(_start+0x2a)[0x5591856b0ada]
 ... runnable/test19891.d            -fPIC ()

which makes me want to scream because the log file gives me NO FRACKIN IDEA which file it was trying to compile when it seg faulted because it runs the tests in multiple threads and randomly interleaves the results.

@CyberShadow
Copy link
Member

Well, that doesn't have anything to do with std.socket, but it seems that disabling concurrency is the obvious solution there. It would mean tests would take several times longer to run, but the first real failure will stop the test suite, so it should always be at the bottom of the log.

@WalterBright
Copy link
Member Author

Instead of getting angry and typing in all caps, why not put that energy into adjusting this change as I suggested?

A test where the results are ignored is COMPLETELY USELESS and should be deleted, as this PR does.

I'm angry because I have about 5 PRs none of which pass the test suite because of random heisenbugs that appear and disappear.

@CyberShadow
Copy link
Member

A test where the results are ignored is COMPLETELY USELESS and should be deleted, as this PR does.

No, it's definitely useful for people working on std.socket directly.

If someone changes the name resolution code, they would have no way of knowing that they broke it unless they tested it themselves.

Currently, the failure will be immediately obvious to anyone running the module unittests directly.

It would be reasonable to disable the tests in CI however, for the reasons you mention, which is exactly what I suggested above.

@WalterBright
Copy link
Member Author

Well, that doesn't have anything to do with std.socket,

It's all part of the same problem we have with tolerating heisenbugs and useless log files in the test suite.

but it seems that disabling concurrency is the obvious solution there. It would mean tests would take several times longer to run, but the first real failure will stop the test suite, so it should always be at the bottom of the log.

Even better would be to fix the test runner to not interleave the results, or at least when a process fails it adds to the stack trace what file it was trying to work on.

@CyberShadow
Copy link
Member

Even better would be to fix the test runner to not interleave the results, or at least when a process fails it adds to the stack trace what file it was trying to work on.

This sounds doable, but only if the test runner (the program starting multiple threads/processes) is under our control (i.e. a D program and not e.g. GNU Make).

I think a better solution would be to disable softUnittest blocks when running in CI, but leave them on when users run tests individually.

@wilzbach @Geod24 Is there some environment variable or other mechanism that we can detect that the unittests are executing under CI, no matter what CI it is (CircleCI, Brad's auto-tester, etc.)?

@WalterBright
Copy link
Member Author

If someone changes the name resolution code, they would have no way of knowing that they broke it unless they tested it themselves.

Or they could do a better job of writing the test so it only asserts if no environmental errors were present. I.e. check the error codes returned by the network functions that are called.

@CyberShadow
Copy link
Member

Or they could do a better job of writing the test so it only asserts if no environmental errors were present. I.e. check the error codes returned by the network functions that are called.

I don't think this is possible in general, and actually this particular failure looks like one instance of that. The assertion failure is probably due to the return value of toHostNameString being null, which isn't associated with any particular error code.

@wilzbach
Copy link
Contributor

wilzbach commented Aug 5, 2020

@wilzbach @Geod24 Is there some environment variable or other mechanism that we can detect that the unittests are executing under CI, no matter what CI it is (CircleCI, Brad's auto-tester, etc.)?

No, they all use different set-ups and environment variables. We would need to set an environment variable for all CIs in question. Though it could be set in the Phobos Makefile which would allow users to locally redefine the variable.
Something like D_PHOBOS_SOFTFAIL_SHOW_STACK ?= 0.

@CyberShadow
Copy link
Member

CyberShadow commented Aug 5, 2020

OK, then I suggest changing the message and making the stack trace opt-in. I will submit a PR in a few minutes.

@WalterBright
Copy link
Member Author

I don't think this is possible in general,

Network errors can't be detected separately from programming bugs? That should not be the case.

and actually this particular failure looks like one instance of that. The assertion failure is probably due to the return value of toHostNameString being null, which isn't associated with any particular error code.

toHostNameString is implemented in socket.d, so it's probably ignoring the error code, which would be a buggy design of socket.d.

@CyberShadow
Copy link
Member

CyberShadow commented Aug 5, 2020

toHostNameString is implemented in socket.d, so it's probably ignoring the error code, which would be a buggy design of socket.d.

I would need access to a machine on which the failure is reproducible, but what's probably happening is that the configured DNS server for that machine, or its upstream server if it is resolving recursively, is unable to resolve PTR requests, and is returning an empty response. The implementation is definitely checking the error code (see Address.toHostString).

@WalterBright
Copy link
Member Author

I looked at toHostString and it is definitely returning null on error, although it fails to document this. It isn't clear if it may return null for other reasons. It also throws for other undocumented reasons. What its argument numeric is for is undocumented.

But it's at least private so an attempt can be made to make it robust.

@CyberShadow
Copy link
Member

I looked at toHostString and it is definitely returning null on error, although it fails to document this. It isn't clear if it may return null for other reasons. It also throws for other undocumented reasons. What its argument numeric is for is undocumented.

The documentation is in the public functions which call toHostString, which is implementation detail.

@WalterBright
Copy link
Member Author

The documentation is in the public functions which call toHostString

Partially true. toAddrString omits mention of the null return.

In any case, I'll amend the PR to check for a null return, and not assert if it is null.

@WalterBright
Copy link
Member Author

Eh, I take that back. toAddrString never returns null.

@CyberShadow
Copy link
Member

I will submit a PR in a few minutes.

#7580

In any case, I'll amend the PR to check for a null return, and not assert if it is null.

I think that will weaken the test, and #7580 is a better (more general) improvement.

@CyberShadow
Copy link
Member

Partially true. toAddrString omits mention of the null return.

toAddrString never returns null.

@WalterBright
Copy link
Member Author

Closing in favor of #7580

Thank you, @CyberShadow

@WalterBright WalterBright deleted the fix21114 branch August 5, 2020 09:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants