Interrupt agent within pytest fixture if it crashes during testing#401
Merged
BrianJKoopman merged 6 commits intomainfrom Aug 29, 2024
Merged
Interrupt agent within pytest fixture if it crashes during testing#401BrianJKoopman merged 6 commits intomainfrom
BrianJKoopman merged 6 commits intomainfrom
Conversation
Without this the output from the agent can get stuck in the buffer and hidden from the user when raise_subprocess() is called, returning just an empty string.
This returns stdout/stderr as strings rather than byte-strings, making the formatting nicer when we print.
Note this does not yet print output from stdout/stderr, but does successfully abort tests if the agent hangs.
6 tasks
Member
Author
|
@jlashner Alright, so I tried replacing the I think this is probably fine, so I'll update and push a commit with this change. |
This better handles the case of agents that run into errors on shutdown and hang.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This adds an interrupt to the agent in the agent runner pytest fixture in the event that the agent crashes and hangs during testing. It will also display the
stdout/stderrfrom the agent when this happens.To do this we also refactor most of the contents of
create_agent_runner_fixture()into theAgentRunnerclass so we can keep track of some things with some class attributes.I tried to present a clean git history. I tested at each commit here the functionality of the "early exit" path and once this timeout was added, the timeout against the test that I knew was hanging as described below. I also ran all ocs and socs tests from each
mainbranch to make sure I wasn't breaking anything.(I've worked through this a couple times now, it was kind of tricky to debug and make sure I wasn't breaking anything. A few things I want to mention that were subtle:
stdout/stderrprinting didn't really seem to work for me without the first commit here disabling output buffering. So thatpython -uis important.proc.stdout.read()method withinraise_subprocess()to getstdout, but would run into an I/O error trying to read from a closed file descriptor. There's a warning belowPopen.stdinabout strictly usingcommunicate()to interact with the process -- for a different reason, but still, one of the iterations of my work on this I restructured things to get rid of the existingstdout.read()calls. I might still do this, but in a future PR.)Motivation and Context
Resolves #398.
For some more context, we were seeing hanging agents cause all testing to hang. In GitHub Actions this means eventually hitting the 6 hour timeout. This is difficult to debug, especially since the agent logs aren't visible. This is a step in the direction of making agent logs visible, while hopefully preventing more of these agent hangs.
How Has This Been Tested?
Through lots of manual testing.
Testing the "Early Exit" Functionality
I wanted to make sure I preserved the existing functionality. I tested the "early exit" path by adding an invalid parameter to the
ocs/tests/default.yamlfor theFakeDataAgentcalled--please-crash. This causes the agent to crash immediately on startup, and we see a test failure and the agent output. Here's an example:Testing the Timeout on a Crashed Agent
I also tested in this socs branch, which is just a checkout of the working branch at the time that #398 was logged. This produces a crash in the agent on several tests, both of which we should discuss. The first is a crash that this PR can handle, here's example output:
We see the agent output, and the extra Exception that we raised related to the agent timeout. This is the functionality this PR is trying to add. Great!
Issues with Errors in Deferreds
However, there's another situation in this branch that I don't know how to handle currently, and that's when there is an unhandled error in a deferred within the task being tested. This happens in the
test_hwp_rotation_get_statetest. If we run:then we see the test start:
However, in the agent logs to disk we see (starting at the crash -- the rest is just normal agent startup and command execution for this task):
This has to do with the
Deferredfromaction.deferred:https://github.com/simonsobs/socs/blob/cf89f1f32405f350d6f1bf557c80602363cc22b9/socs/agents/hwp_pid/agent.py#L315-L317
I can add an
Errbackand this always exits. (It very rarely exits cleanly without, just to mention.) I just don't know how to exit the agent in this event at the moment. Open to suggestions.Agent Not Terminating
I don't really know how to reproduce the behavior of the agent not exiting on SIGINT, so I didn't test this. I didn't really touch that code though, so probably ok? Let me know if you know how to test this.
Types of changes
Checklist: