Skip to content

Conversation

@sjperkins
Copy link
Member

@sjperkins sjperkins commented Mar 24, 2022

@sjperkins sjperkins marked this pull request as draft March 24, 2022 09:02
@sjperkins
Copy link
Member Author

This PR is based on top of #5954 and should be rebased when #5954 is merged.

TODO:

  • At present, assert_story's monotonically increasing timestamp requirement doesn't work well the cluster-wide stories.
  • The Client currently requests the Scheduler for story keys, which in turn requests stories from Workers. It may be possible for the Client to simply request stories from the Workers.
  • Comm errors when requesting stories are not gracefully handled.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 24, 2022

Unit Test Results

       12 files  ±  0         12 suites  ±0   6h 19m 36s ⏱️ - 16m 38s
  2 673 tests +  4    2 591 ✔️ +  4    81 💤  - 1  1 +1 
15 942 runs  +24  15 077 ✔️ +22  859 💤  - 4  6 +6 

For more details on these failures, see this check.

Results for commit 2888b26. ± Comparison against base commit 6dd928b.

♻️ This comment has been updated with latest results.

@sjperkins
Copy link
Member Author

sjperkins commented Mar 24, 2022

From #5872

The failure behaviour of timed out workers should be configurable (raise, ignore). TBD: Should there be a sentinel if ignored?

and #5987 (comment)

Comm errors when requesting stories are not gracefully handled.

asyncio.gather(..., return_exceptions=True) offers a way forward.

@sjperkins sjperkins self-assigned this Mar 24, 2022

async def get_story(self, *args, **kw):
await self.unblock_worker.wait()
return super().get_story(*args, **kw)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return super().get_story(*args, **kw)
return await super().get_story(*args, **kw)

return await task
except Exception:
if on_error == "raise":
task.cancel()
Copy link
Member

@graingert graingert Mar 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the task will always be done at this point - because of the return await task

Comment on lines 7499 to 7519
bits = [
(ws, await self.rpc.connect(ws.address)) for ws in self.workers.values()
]
tasks = []

for _, worker_comm in bits:
coro = send_recv(comm=worker_comm, reply=True, op="get_story", keys=keys)
tasks.append(asyncio.ensure_future(coro))

try:
worker_stories = await asyncio.gather(
*tasks, return_exceptions=on_error == "ignore"
)
except Exception:
for task in tasks:
task.cancel()

raise
finally:
for worker_state, worker_comm in bits:
self.rpc.reuse(worker_state.address, worker_comm)
Copy link
Member

@graingert graingert Mar 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think reproducing the pattern from

async def send_message(addr):
try:
comm = await self.rpc.connect(addr)
comm.name = "Scheduler Broadcast"
try:
resp = await send_recv(
comm, close=True, serializers=serializers, **msg
)
finally:
self.rpc.reuse(addr, comm)
return resp
except Exception as e:
logger.error(f"broadcast to {addr} failed: {e.__class__.__name__}: {e}")
if on_error == "raise":
raise
elif on_error == "return":
return e
elif on_error == "return_pickle":
return dumps(e, protocol=4)
elif on_error == "ignore":
return ERROR
else:
raise ValueError(
"on_error must be 'raise', 'return', 'return_pickle', "
f"or 'ignore'; got {on_error!r}"
)
results = await All(
[send_message(address) for address in addresses if address is not None]
)
return {k: v for k, v in zip(workers, results) if v is not ERROR}
would be cleaner than using gather/return_exceptions/cancel as they all interact slightly strangely

@fjetter
Copy link
Member

fjetter commented May 17, 2022

Closed by #6161

@fjetter fjetter closed this May 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Client.story - Support collecting cluster-wide story for a key or stimulus ID

3 participants