Interrupt agent within pytest fixture if it crashes during testing by BrianJKoopman · Pull Request #401 · simonsobs/ocs

BrianJKoopman · 2024-08-22T22:04:36Z

Description

This adds an interrupt to the agent in the agent runner pytest fixture in the event that the agent crashes and hangs during testing. It will also display the stdout/stderr from the agent when this happens.

To do this we also refactor most of the contents of create_agent_runner_fixture() into the AgentRunner class so we can keep track of some things with some class attributes.

I tried to present a clean git history. I tested at each commit here the functionality of the "early exit" path and once this timeout was added, the timeout against the test that I knew was hanging as described below. I also ran all ocs and socs tests from each main branch to make sure I wasn't breaking anything.

(I've worked through this a couple times now, it was kind of tricky to debug and make sure I wasn't breaking anything. A few things I want to mention that were subtle:

Existing stdout/stderr printing didn't really seem to work for me without the first commit here disabling output buffering. So that python -u is important.
I tried using the existing proc.stdout.read() method within raise_subprocess() to get stdout, but would run into an I/O error trying to read from a closed file descriptor. There's a warning below Popen.stdin about strictly using communicate() to interact with the process -- for a different reason, but still, one of the iterations of my work on this I restructured things to get rid of the existing stdout.read() calls. I might still do this, but in a future PR.)

Motivation and Context

Resolves #398.

For some more context, we were seeing hanging agents cause all testing to hang. In GitHub Actions this means eventually hitting the 6 hour timeout. This is difficult to debug, especially since the agent logs aren't visible. This is a step in the direction of making agent logs visible, while hopefully preventing more of these agent hangs.

How Has This Been Tested?

Through lots of manual testing.

Testing the "Early Exit" Functionality

I wanted to make sure I preserved the existing functionality. I tested the "early exit" path by adding an invalid parameter to the ocs/tests/default.yaml for the FakeDataAgent called --please-crash. This causes the agent to crash immediately on startup, and we see a test failure and the agent output. Here's an example:

$ python3 -m pytest --cov --cov-report term integration/test_fake_data_agent_integration.py -k 'heartbeat'
===================================================================== test session starts =====================================================================
platform linux -- Python 3.11.9, pytest-8.3.2, pluggy-1.5.0
rootdir: /home/koopman/git/ocs/tests
configfile: pytest.ini
plugins: cov-5.0.0, docker-3.1.1, twisted-1.14.2
collected 4 items / 3 deselected / 1 selected

integration/test_fake_data_agent_integration.py E                                                                                                       [100%]

=========================================================================== ERRORS ============================================================================
____________________________________________________ ERROR at setup of test_fake_data_agent_set_heartbeat _____________________________________________________

cov = <coverage.control.Coverage object at 0x7665de1dbf10>

    @pytest.fixture()
    def run_agent(cov):
        runner = _AgentRunner(agent_path, agent_name, args)
>       runner.run(timeout=timeout)

../ocs/testing.py:131:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../ocs/testing.py:77: in run
    self._raise_subprocess(f"Agent failed to startup, cmd: {self.cmd}")
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <ocs.testing._AgentRunner object at 0x7665dc606e50>
msg = "Agent failed to startup, cmd: ['python', '-u', '-m', 'coverage', 'run', '--rcfile=./.coveragerc', '../ocs/agents/fake_data/agent.py', '--site-file', './default.yaml']"

    def _raise_subprocess(self, msg):
        stdout, stderr = self.proc.stdout.read(), self.proc.stderr.read()
        print(f'Here is stdout from {self.agent_name}:\n{stdout}')
        print(f'Here is stderr from {self.agent_name}:\n{stderr}')
>       raise RuntimeError(msg)
E       RuntimeError: Agent failed to startup, cmd: ['python', '-u', '-m', 'coverage', 'run', '--rcfile=./.coveragerc', '../ocs/agents/fake_data/agent.py', '--site-file', './default.yaml']

../ocs/testing.py:87: RuntimeError
-------------------------------------------------------------------- Captured stdout setup --------------------------------------------------------------------
Crossbar server not online yet, waiting 5 seconds.
Crossbar server online.
Here is stdout from fake_data:
2024-08-22T17:40:33-0400 usage: agent.py [-h] [--mode {idle,acq}] [--num-channels NUM_CHANNELS]
2024-08-22T17:40:33-0400                 [--sample-rate SAMPLE_RATE] [--frame-length FRAME_LENGTH]
2024-08-22T17:40:33-0400                 [--site SITE] [--site-file SITE_FILE] [--site-host SITE_HOST]
2024-08-22T17:40:33-0400                 [--site-hub SITE_HUB] [--site-http SITE_HTTP]
2024-08-22T17:40:33-0400                 [--site-realm SITE_REALM] [--instance-id INSTANCE_ID]
2024-08-22T17:40:33-0400                 [--address-root ADDRESS_ROOT]
2024-08-22T17:40:33-0400                 [--registry-address REGISTRY_ADDRESS] [--log-dir LOG_DIR]
2024-08-22T17:40:33-0400                 [--working-dir WORKING_DIR]
2024-08-22T17:40:33-0400                 [--crossbar-timeout CROSSBAR_TIMEOUT]
2024-08-22T17:40:33-0400 agent.py: error: unrecognized arguments: --please-crash

Here is stderr from fake_data:


---------- coverage: platform linux, python 3.11.9-final-0 -----------
Name                                                             Stmts   Miss  Cover
------------------------------------------------------------------------------------
/home/koopman/git/ocs/ocs/__init__.py                                4      0   100%
/home/koopman/git/ocs/ocs/agent_cli.py                              87     87     0%
/home/koopman/git/ocs/ocs/agents/aggregator/agent.py                70     70     0%
/home/koopman/git/ocs/ocs/agents/aggregator/drivers.py             284    284     0%
/home/koopman/git/ocs/ocs/agents/barebones/agent.py                 74     74     0%
/home/koopman/git/ocs/ocs/agents/fake_data/agent.py                161    115    29%
/home/koopman/git/ocs/ocs/agents/host_manager/agent.py             322    322     0%
/home/koopman/git/ocs/ocs/agents/host_manager/drivers.py           177    177     0%
/home/koopman/git/ocs/ocs/agents/influxdb_publisher/agent.py        67     67     0%
/home/koopman/git/ocs/ocs/agents/influxdb_publisher/drivers.py     104    104     0%
/home/koopman/git/ocs/ocs/agents/ocs_plugin_standard.py              5      5     0%
/home/koopman/git/ocs/ocs/agents/registry/agent.py                 110    110     0%
/home/koopman/git/ocs/ocs/base.py                                   17      0   100%
/home/koopman/git/ocs/ocs/checkdata.py                             123    123     0%
/home/koopman/git/ocs/ocs/client_cli.py                            149    149     0%
/home/koopman/git/ocs/ocs/client_http.py                            38     38     0%
/home/koopman/git/ocs/ocs/client_t.py                               64     64     0%
/home/koopman/git/ocs/ocs/matched_client.py                          5      5     0%
/home/koopman/git/ocs/ocs/ocs_agent.py                             586    586     0%
/home/koopman/git/ocs/ocs/ocs_client.py                             91     79    13%
/home/koopman/git/ocs/ocs/ocs_feed.py                              128    128     0%
/home/koopman/git/ocs/ocs/ocs_systemd.py                            74     74     0%
/home/koopman/git/ocs/ocs/ocs_twisted.py                            69     69     0%
/home/koopman/git/ocs/ocs/ocsbow.py                                687    687     0%
/home/koopman/git/ocs/ocs/plugin.py                                  2      2     0%
/home/koopman/git/ocs/ocs/rename.py                                 49     49     0%
/home/koopman/git/ocs/ocs/site_config.py                           284    236    17%
/home/koopman/git/ocs/ocs/testing.py                                93     34    63%
------------------------------------------------------------------------------------
TOTAL                                                             3924   3738     5%

7 empty files skipped.

=================================================================== short test summary info ===================================================================
ERROR integration/test_fake_data_agent_integration.py::test_fake_data_agent_set_heartbeat - RuntimeError: Agent failed to startup, cmd: ['python', '-u', '-m', 'coverage', 'run', '--rcfile=./.coveragerc', '../ocs/agents/fake_data/agent.py', '--sit...
=============================================================== 3 deselected, 1 error in 17.45s ===============================================================

Testing the Timeout on a Crashed Agent

I also tested in this socs branch, which is just a checkout of the working branch at the time that #398 was logged. This produces a crash in the agent on several tests, both of which we should discuss. The first is a crash that this PR can handle, here's example output:

$ python -m pytest --cov --cov-report html integration/test_hwp_pid_agent_integration.py -k 'direction'
===================================================================== test session starts =====================================================================
platform linux -- Python 3.11.9, pytest-8.3.2, pluggy-1.5.0
rootdir: /home/koopman/git/socs/tests
configfile: pytest.ini
plugins: cov-5.0.0, dependency-0.6.0, docker-3.1.1, twisted-1.14.2, order-1.2.1
collected 8 items / 7 deselected / 1 selected

integration/test_hwp_pid_agent_integration.py FE                                                                                                        [100%]

=========================================================================== ERRORS ============================================================================
____________________________________________________ ERROR at teardown of test_hwp_rotation_set_direction _____________________________________________________

cov = <coverage.control.Coverage object at 0x739e2cab8e10>

    @pytest.fixture()
    def run_agent(cov):
        runner = _AgentRunner(agent_path, agent_name, args)
        runner.run(timeout=timeout)

        yield

>       runner.shutdown()

../../ocs/ocs/testing.py:135:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <ocs.testing._AgentRunner object at 0x739e2c5138d0>

    def shutdown(self):
        """Shutdown the agent process.

        If the agent does not respond to a ``SIGINT`` then output is printed,
        and an exception raised.

        """
        # avoid sending SIGINT twice
        if not self._timedout:
            self.proc.send_signal(signal.SIGINT)
        self._timer.cancel()

        try:
            self.proc.communicate(timeout=SIGINT_TIMEOUT)
        except subprocess.TimeoutExpired:
            self._raise_subprocess('Agent did not terminate within '
                                   f'{SIGINT_TIMEOUT} seconds on SIGINT.')

        if self._timedout:
            stdout, stderr = self.proc.communicate(timeout=SIGINT_TIMEOUT)
            print(f'Here is stdout from {self.agent_name}:\n{stdout}')
            print(f'Here is stderr from {self.agent_name}:\n{stderr}')
>           raise RuntimeError('Agent timed out.')
E           RuntimeError: Agent timed out.

../../ocs/ocs/testing.py:111: RuntimeError
-------------------------------------------------------------------- Captured stdout setup --------------------------------------------------------------------
Crossbar server not online yet, waiting 5 seconds.
Crossbar server online.
/home/koopman/git/socs/tests
-------------------------------------------------------------------- Captured stderr setup --------------------------------------------------------------------
HWP: Starting update thread
PID: Device emulator waiting for tcp client connection
PID: Client connection made from ('127.0.0.1', 56186)
PID: msg='*W02400000'
PID: Setting direction: reverse
PID: response='asdfl'
PID: msg='*X01'
PID: response='X010.000'
--------------------------------------------------------------------- Captured log setup ----------------------------------------------------------------------
INFO     HWP:hwp_emulator.py:103 Starting update thread
INFO     PID:device_emulator.py:256 Device emulator waiting for tcp client connection
INFO     PID:device_emulator.py:258 Client connection made from ('127.0.0.1', 56186)
DEBUG    PID:device_emulator.py:282 msg='*W02400000'
INFO     PID:hwp_emulator.py:178 Setting direction: reverse
DEBUG    PID:device_emulator.py:293 response='asdfl'
DEBUG    PID:device_emulator.py:282 msg='*X01'
DEBUG    PID:device_emulator.py:293 response='X010.000'
-------------------------------------------------------------------- Captured stderr call ---------------------------------------------------------------------
PID: msg='*R01'
PID: response='R010000'
PID: msg='*R02'
PID: response='0'
---------------------------------------------------------------------- Captured log call ----------------------------------------------------------------------
DEBUG    PID:device_emulator.py:282 msg='*R01'
DEBUG    PID:device_emulator.py:293 response='R010000'
DEBUG    PID:device_emulator.py:282 msg='*R02'
DEBUG    PID:device_emulator.py:293 response='0'
------------------------------------------------------------------ Captured stdout teardown -------------------------------------------------------------------
Here is stdout from hwp_pid_agent:
2024-08-22T17:44:04-0400 Using OCS version 0.11.1+7.gdc886b4.dirty
2024-08-22T17:44:04-0400 Cannot use relative log_dir without explicit working_dir.
2024-08-22T17:44:04-0400 ocs: starting <class 'ocs.ocs_agent.OCSAgent'> @ observatory.hwp-pid
2024-08-22T17:44:04-0400 log_file is apparently ./logs//observatory.hwp-pid.log
2024-08-22T17:44:04-0400 --mode agrument is deprecated.
2024-08-22T17:44:04-0400 transport connected
2024-08-22T17:44:04-0400 session joined: {'authextra': {'x_cb_node': 'f536ddd40539-1',
               'x_cb_peer': 'tcp4:172.19.0.1:40074',
               'x_cb_pid': 22,
               'x_cb_worker': 'worker001'},
 'authid': 'T65S-WJMN-7LRK-M5GT-STYM-7T7T',
 'authmethod': 'anonymous',
 'authprovider': 'static',
 'authrole': 'iocs_agent',
 'realm': 'test_realm',
 'resumable': False,
 'resume_token': None,
 'resumed': False,
 'serializer': 'cbor.batched',
 'session': 7723584651398491,
 'transport': {'channel_framing': 'websocket',
               'channel_id': {},
               'channel_serializer': None,
               'channel_type': 'tcp',
               'http_cbtid': None,
               'http_headers_received': None,
               'http_headers_sent': None,
               'is_secure': False,
               'is_server': False,
               'own': None,
               'own_fd': -1,
               'own_pid': 160161,
               'own_tid': 160161,
               'peer': 'tcp4:127.0.0.1:18001',
               'peer_cert': None,
               'websocket_extensions_in_use': None,
               'websocket_protocol': None}}
2024-08-22T17:44:04-0400 startup-op: launching main
2024-08-22T17:44:04-0400 start called for main
2024-08-22T17:44:04-0400 main:0 Status is now "starting".
2024-08-22T17:44:04-0400 main:0 Status is now "running".
2024-08-22T17:44:04-0400 Forward
2024-08-22T17:44:05-0400 asdfl
2024-08-22T17:44:05-0400 Connected to PID controller
2024-08-22T17:44:05-0400 Finding CHWP Frequency
2024-08-22T17:44:05-0400 ['X010.000']
2024-08-22T17:44:05-0400 DecodedResponse(msg_type='measure', msg='Current frequency = 0.0', measure=0.0)
2024-08-22T17:44:05-0400 Finding target CHWP Frequency
2024-08-22T17:44:06-0400 ['R010000']
2024-08-22T17:44:06-0400 DecodedResponse(msg_type='read', msg='Setpoint = 0.0', measure=0.0)
2024-08-22T17:44:06-0400 Finding CHWP Direction
2024-08-22T17:44:06-0400 ['0']
2024-08-22T17:44:06-0400 0
2024-08-22T17:44:06-0400 main:0 CRASH: [Failure instance: Traceback: <class 'AttributeError'>: 'str' object has no attribute 'msg_type'
/home/koopman/.pyenv/versions/3.11.9/lib/python3.11/threading.py:1045:_bootstrap_inner
/home/koopman/.pyenv/versions/3.11.9/lib/python3.11/threading.py:982:run
/home/koopman/git/socs/.venv/lib/python3.11/site-packages/twisted/_threads/_threadworker.py:49:work
/home/koopman/git/socs/.venv/lib/python3.11/site-packages/twisted/_threads/_team.py:192:doWork
--- <exception caught here> ---
/home/koopman/git/socs/.venv/lib/python3.11/site-packages/twisted/python/threadpool.py:269:inContext
/home/koopman/git/socs/.venv/lib/python3.11/site-packages/twisted/python/threadpool.py:285:<lambda>
/home/koopman/git/socs/.venv/lib/python3.11/site-packages/twisted/python/context.py:117:callWithContext
/home/koopman/git/socs/.venv/lib/python3.11/site-packages/twisted/python/context.py:82:callWithContext
/home/koopman/git/ocs/ocs/ocs_agent.py:984:_running_wrapper
/home/koopman/git/socs/socs/agents/hwp_pid/agent.py:174:main
/home/koopman/git/socs/socs/agents/hwp_pid/agent.py:121:_get_data_and_publish
/home/koopman/git/socs/socs/agents/hwp_pid/agent.py:35:get_pid_state
/home/koopman/git/socs/socs/agents/hwp_pid/drivers/pid_controller.py:307:get_direction
]
2024-08-22T17:44:06-0400 main:0 Status is now "done".
2024-08-22T17:44:24-0400 caught SIGINT!
2024-08-22T17:44:24-0400 Stopping all running sessions
2024-08-22T17:44:24-0400 Stopping session main
2024-08-22T17:44:27-0400 stopping reactor
2024-08-22T17:44:27-0400 session left: CloseDetails(reason=<wamp.close.normal>, message='None')
2024-08-22T17:44:27-0400 Stopping all running sessions
2024-08-22T17:44:27-0400 Stopping session main
2024-08-22T17:44:30-0400 transport disconnected
2024-08-22T17:44:30-0400 waiting for reconnection
2024-08-22T17:44:30-0400 waiting at least 9.999995708465576 more seconds before giving up
2024-08-22T17:44:30-0400 Scheduling retry 1 to connect <twisted.internet.endpoints.TCP4ClientEndpoint object at 0x7c59f6d5cc50> in 1.559190937396653 seconds.
2024-08-22T17:44:30-0400 Main loop terminated.

Here is stderr from hwp_pid_agent:

------------------------------------------------------------------ Captured stderr teardown -------------------------------------------------------------------
PID: Client disconnected
PID: Shutting down
-------------------------------------------------------------------- Captured log teardown --------------------------------------------------------------------
INFO     PID:device_emulator.py:264 Client disconnected
INFO     PID:device_emulator.py:272 Shutting down
========================================================================== FAILURES ===========================================================================
_______________________________________________________________ test_hwp_rotation_set_direction _______________________________________________________________

wait_for_crossbar = None, hwp_emu = <socs.testing.hwp_emulator.HWPEmulator object at 0x739e2c798210>, run_agent = None, client = OCSClient('hwp-pid')

    @pytest.mark.integtest
    def test_hwp_rotation_set_direction(wait_for_crossbar, hwp_emu, run_agent, client):
>       wait_for_main(client)

integration/test_hwp_pid_agent_integration.py:46:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
integration/test_hwp_pid_agent_integration.py:24: in wait_for_main
    data = client.main.status().session['data']
../../ocs/ocs/ocs_client.py:29: in status
    return OCSReply(*client.request('status', name))
../../ocs/ocs/client_http.py:100: in request
    return self.call(self.agent_addr + '.ops', action, op_name, params, **kw)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <ocs.client_http.ControlClient object at 0x739e2c98efd0>, procedure = 'observatory.hwp-pid.ops', args = ('status', 'main', {}), kwargs = {}
params = '{"procedure": "observatory.hwp-pid.ops", "args": ["status", "main", {}], "kwargs": {}}', r = <Response [200]>
decoded = {'args': ['no callee registered for procedure <observatory.hwp-pid.ops>'], 'error': 'wamp.error.no_such_procedure', 'kwargs': {}}

    def call(self, procedure, *args, **kwargs):
        # curl -H "Content-Type: application/json"
        #     -d '{"procedure": "observatory.acu1",
        #          "args": ["get_tasks"]}'
        #     http://127.0.0.1:8001/call
        params = json.dumps({'procedure': procedure,
                             'args': args, 'kwargs': kwargs})
        try:
            r = requests.post(self.call_url, data=params)
        except requests.exceptions.ConnectionError:
            raise ControlClientError([0, 0, 0, 0, 'client_http.error.connection_error',
                                      ['Failed to connect to %s' % self.call_url], {}])
        if r.status_code != 200:
            raise ControlClientError([0, 0, 0, 0, 'client_http.error.request_error',
                                      ['Server replied with code %i' % r.status_code], {}])
        decoded = r.json()
        if 'error' in decoded:
            # Return errors in the same way wampy does, roughly.
>           raise ControlClientError([0, 0, 0, 0, decoded['error'], decoded['args'], decoded['kwargs']])
E           ocs.client_http.ControlClientError: [0, 0, 0, 0, 'wamp.error.no_such_procedure', ['no callee registered for procedure <observatory.hwp-pid.ops>'], {}]

../../ocs/ocs/client_http.py:41: ControlClientError
-------------------------------------------------------------------- Captured stdout setup --------------------------------------------------------------------
Crossbar server not online yet, waiting 5 seconds.
Crossbar server online.
/home/koopman/git/socs/tests
-------------------------------------------------------------------- Captured stderr setup --------------------------------------------------------------------
HWP: Starting update thread
PID: Device emulator waiting for tcp client connection
PID: Client connection made from ('127.0.0.1', 56186)
PID: msg='*W02400000'
PID: Setting direction: reverse
PID: response='asdfl'
PID: msg='*X01'
PID: response='X010.000'
--------------------------------------------------------------------- Captured log setup ----------------------------------------------------------------------
INFO     HWP:hwp_emulator.py:103 Starting update thread
INFO     PID:device_emulator.py:256 Device emulator waiting for tcp client connection
INFO     PID:device_emulator.py:258 Client connection made from ('127.0.0.1', 56186)
DEBUG    PID:device_emulator.py:282 msg='*W02400000'
INFO     PID:hwp_emulator.py:178 Setting direction: reverse
DEBUG    PID:device_emulator.py:293 response='asdfl'
DEBUG    PID:device_emulator.py:282 msg='*X01'
DEBUG    PID:device_emulator.py:293 response='X010.000'
-------------------------------------------------------------------- Captured stderr call ---------------------------------------------------------------------
PID: msg='*R01'
PID: response='R010000'
PID: msg='*R02'
PID: response='0'
---------------------------------------------------------------------- Captured log call ----------------------------------------------------------------------
DEBUG    PID:device_emulator.py:282 msg='*R01'
DEBUG    PID:device_emulator.py:293 response='R010000'
DEBUG    PID:device_emulator.py:282 msg='*R02'
DEBUG    PID:device_emulator.py:293 response='0'

---------- coverage: platform linux, python 3.11.9-final-0 -----------
Coverage HTML written to dir htmlcov

=================================================================== short test summary info ===================================================================
FAILED integration/test_hwp_pid_agent_integration.py::test_hwp_rotation_set_direction - ocs.client_http.ControlClientError: [0, 0, 0, 0, 'wamp.error.no_such_procedure', ['no callee registered for procedure <observatory.hwp-pid.ops>'], {}]
ERROR integration/test_hwp_pid_agent_integration.py::test_hwp_rotation_set_direction - RuntimeError: Agent timed out.
========================================================== 1 failed, 7 deselected, 1 error in 35.31s ==========================================================

We see the agent output, and the extra Exception that we raised related to the agent timeout. This is the functionality this PR is trying to add. Great!

Issues with Errors in Deferreds

However, there's another situation in this branch that I don't know how to handle currently, and that's when there is an unhandled error in a deferred within the task being tested. This happens in the test_hwp_rotation_get_state test. If we run:

$ python -m pytest --cov --cov-report html integration/test_hwp_pid_agent_integration.py -k 'get_state'

then we see the test start:

===================================================================== test session starts =====================================================================
platform linux -- Python 3.11.9, pytest-8.3.2, pluggy-1.5.0
rootdir: /home/koopman/git/socs/tests
configfile: pytest.ini
plugins: cov-5.0.0, dependency-0.6.0, docker-3.1.1, twisted-1.14.2, order-1.2.1
collected 8 items / 7 deselected / 1 selected

integration/test_hwp_pid_agent_integration.py

However, in the agent logs to disk we see (starting at the crash -- the rest is just normal agent startup and command execution for this task):

2024-08-22T21-47-32.226948 main:0 CRASH: [Failure instance: Traceback: <class 'AttributeError'>: 'str' object has no attribute 'msg_type'
/home/koopman/.pyenv/versions/3.11.9/lib/python3.11/threading.py:1045:_bootstrap_inner
/home/koopman/.pyenv/versions/3.11.9/lib/python3.11/threading.py:982:run
/home/koopman/git/socs/.venv/lib/python3.11/site-packages/twisted/_threads/_threadworker.py:49:work
/home/koopman/git/socs/.venv/lib/python3.11/site-packages/twisted/_threads/_team.py:192:doWork
--- <exception caught here> ---
/home/koopman/git/socs/.venv/lib/python3.11/site-packages/twisted/python/threadpool.py:269:inContext
/home/koopman/git/socs/.venv/lib/python3.11/site-packages/twisted/python/threadpool.py:285:<lambda>
/home/koopman/git/socs/.venv/lib/python3.11/site-packages/twisted/python/context.py:117:callWithContext
/home/koopman/git/socs/.venv/lib/python3.11/site-packages/twisted/python/context.py:82:callWithContext
/home/koopman/git/ocs/ocs/ocs_agent.py:984:_running_wrapper
/home/koopman/git/socs/socs/agents/hwp_pid/agent.py:174:main
/home/koopman/git/socs/socs/agents/hwp_pid/agent.py:121:_get_data_and_publish
/home/koopman/git/socs/socs/agents/hwp_pid/agent.py:35:get_pid_state
/home/koopman/git/socs/socs/agents/hwp_pid/drivers/pid_controller.py:307:get_direction
]
2024-08-22T21-47-32.227409 main:0 Status is now "done".
2024-08-22T21-47-49.758156 caught SIGINT!
2024-08-22T21-47-49.758791 Stopping all running sessions
2024-08-22T21-47-49.759159 Stopping session main
2024-08-22T21-47-49.759645 Stopping session get_state
2024-08-22T21-47-49.760110 abort called for get_state
2024-08-22T21-47-49.761047 Unhandled error in Deferred:
2024-08-22T21-47-49.761390

This has to do with the Deferred from action.deferred:
https://github.com/simonsobs/socs/blob/cf89f1f32405f350d6f1bf557c80602363cc22b9/socs/agents/hwp_pid/agent.py#L315-L317

I can add an Errback and this always exits. (It very rarely exits cleanly without, just to mention.) I just don't know how to exit the agent in this event at the moment. Open to suggestions.

Agent Not Terminating

I don't really know how to reproduce the behavior of the agent not exiting on SIGINT, so I didn't test this. I didn't really touch that code though, so probably ok? Let me know if you know how to test this.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.

Without this the output from the agent can get stuck in the buffer and hidden from the user when raise_subprocess() is called, returning just an empty string.

This returns stdout/stderr as strings rather than byte-strings, making the formatting nicer when we print.

Note this does not yet print output from stdout/stderr, but does successfully abort tests if the agent hangs.

BrianJKoopman · 2024-08-28T20:53:22Z

@jlashner Alright, so I tried replacing the SIGINT in the interrupt timer with a SIGKILL and it does handle the case with the Deferreds. In the typical case it'll also work, it just won't be a graceful shutdown of the agent, cutting off the usual:

2024-08-22T17:44:24-0400 caught SIGINT!
2024-08-22T17:44:24-0400 Stopping all running sessions
2024-08-22T17:44:24-0400 Stopping session main
2024-08-22T17:44:27-0400 stopping reactor
2024-08-22T17:44:27-0400 session left: CloseDetails(reason=<wamp.close.normal>, message='None')
2024-08-22T17:44:27-0400 Stopping all running sessions
2024-08-22T17:44:27-0400 Stopping session main
2024-08-22T17:44:30-0400 transport disconnected
2024-08-22T17:44:30-0400 waiting for reconnection
2024-08-22T17:44:30-0400 waiting at least 9.999995708465576 more seconds before giving up
2024-08-22T17:44:30-0400 Scheduling retry 1 to connect <twisted.internet.endpoints.TCP4ClientEndpoint object at 0x7c59f6d5cc50> in 1.559190937396653 seconds.
2024-08-22T17:44:30-0400 Main loop terminated.

I think this is probably fine, so I'll update and push a commit with this change.

This better handles the case of agents that run into errors on shutdown and hang.

BrianJKoopman added 5 commits August 21, 2024 18:55

Disable output buffering in agent subprocess

b945b6f

Without this the output from the agent can get stuck in the buffer and hidden from the user when raise_subprocess() is called, returning just an empty string.

Run subprocess in text mode

2819767

This returns stdout/stderr as strings rather than byte-strings, making the formatting nicer when we print.

Move subprocess management to new class

4925a7e

Interrupt agent after timeout expires

e5ec512

Note this does not yet print output from stdout/stderr, but does successfully abort tests if the agent hangs.

Print stdout/stderr if agent times out

dc886b4

BrianJKoopman requested review from jlashner and mhasself August 22, 2024 22:04

BrianJKoopman mentioned this pull request Aug 27, 2024

Added ability to query more temperatures in agent acq simonsobs/socs#730

Open

6 tasks

Send a SIGKILL if agent timesout instead of SIGINT

5dd79e2

This better handles the case of agents that run into errors on shutdown and hang.

BrianJKoopman merged commit fe03421 into main Aug 29, 2024

BrianJKoopman deleted the koopman/agent-fixture-timeout-class branch August 29, 2024 18:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interrupt agent within pytest fixture if it crashes during testing#401

Interrupt agent within pytest fixture if it crashes during testing#401
BrianJKoopman merged 6 commits intomainfrom
koopman/agent-fixture-timeout-class

BrianJKoopman commented Aug 22, 2024

Uh oh!

BrianJKoopman commented Aug 28, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

BrianJKoopman commented Aug 22, 2024

Description

Motivation and Context

How Has This Been Tested?

Testing the "Early Exit" Functionality

Testing the Timeout on a Crashed Agent

Issues with Errors in Deferreds

Agent Not Terminating

Types of changes

Checklist:

Uh oh!

BrianJKoopman commented Aug 28, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant