-
Notifications
You must be signed in to change notification settings - Fork 16.4k
KPO async logging callback #60778
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
KPO async logging callback #60778
Conversation
SameerMesiah97
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few comments. Please review.
| if mode == ExecutionMode.ASYNC: | ||
| return f(*args, mode=mode, **kwargs) | ||
| return asyncio.run(f(*args, mode=mode, **kwargs)) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you confident that def serialiable_callback will not be called from within a running event loop? Since this is a helper we cannot guarantee that it will not be invoked in a context without a pre-existing event loop and this could result ina RuntimeError. I believe it would be better to not special-case ASYNC mode and let the wrapper return the function as is regardless of ExecutionMode without invoking asyncio.run(). Unless you have a strong reason to do it this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well the idea was to define a helper function that could be used in a callback like
class AsyncCallback(KubernetesPodOperatorCallback):
@staticmethod
@serializable_callback
async def progress_callback(
*, line: str, **kwargs
) -> None:
...and that the same callback could be used in the triggerer or in the operator, which is invoked when the triggerer hands back to the operator if there are remaining logs in the pod that haven't been processed. Ideally then the callback should be written in async format, as it would be blocking in the callback if it wasn't, though that obviously depends on what the callback is doing.
Invocations of these callbacks are in the operator/triggerer, I don't know of anywhere else they are used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the invocation of @serializable_callback is not being guarded and you are relying on the user correctly inferring the intent of the function? I agree with your motivation for introducing this helper but we cannot guarantee that it will not be called within an event loop in the operator. Is there any reason why the operator cannot do something like this:
asyncio.run(callback(...))
If that is not feasible, I believe at the very minimum, it should be:
The possibility of encountering RuntimeError should be documented very clearly (docstring and comment)
The RuntimeError should be caught in a try/except block with a more informative error message.
Below is a suggested implementation if you are still intent on keeping 2 separate modes:
def serializable_callback(f):
"""
Convert async callback so it can run in sync or async mode.
In ASYNC mode (e.g. triggerer), the callback is expected to be awaited
by the caller. In SYNC mode (e.g. operator fallback), the callback is
executed via asyncio.run(); callers should ensure this is only used
when no event loop is already running.
"""
@wraps(f)
def wrapper(*args, mode: str, **kwargs):
if mode == ExecutionMode.ASYNC:
return f(*args, mode=mode, **kwargs)
# SYNC mode owns the event loop; calling this while a loop is already
# running is a hard error and indicates a misclassified execution context.
try:
return asyncio.run(f(*args, mode=mode, **kwargs))
except RuntimeError as e:
raise RuntimeError(
"Cannot call serializable_callback in SYNC mode while an event "
"loop is running. Use ExecutionMode.ASYNC and await the callback "
"instead."
) from e
return wrapper
This will immediately inform the user of the reason for the RuntimeError and mitigate against further unsafe usage.
| pod=pod, | ||
| ) | ||
| if asyncio.iscoroutine(cb): | ||
| await cb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we really be executing arbitrary user code inside the triggerer? Even though this is async, a long-running or blocking progress callback (for example calling an external API without a timeout) can still starve the triggerer’s event loop. That at least blocks the trigger executing it, and potentially other triggers handled by the same triggerer process. This feels like a fairly big design trade-off just to support progress logging, and I’m not sure it’s worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The specific use case I have in mind is in support of the watcher pattern that is being implement in astronomer cosmos for running DBT. astronomer/astronomer-cosmos#2207
With this you have a pod that's actually running DBT and an airflow task that is parsing kubernetes logs to extract DBT events and create xcom variables that are consumed by sensors. I think the parsing of the logs and setting xcom variables should be lightweight enough that it can be run from the triggerer. Implementing this from the triggerer was part of the original design of the defferred mode for KPO, but the implementation ran into issues so it was stripped back out.
I did wonder if I should check the callbacks and only pass the ones that have actually implemented progress_callbacks to make everything a little lighter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The key phrase here is “should be lightweight”. Can we really guarantee that? Reading the code, it doesn’t seem like there are any restrictions on what a callback could execute, so we are effectively trusting users not to do any heavy lifting in them.
One possible compromise would be to enforce a small global timeout for callbacks (e.g. a few seconds at most). However, even with that, this still sets a precedent for executing arbitrary user code in the triggerer.
I agree the motivation here is solid and I can see the value of the feature, but this crosses into a fundamental triggerer design decision, which I’m not comfortable approving or disapproving unilaterally.
@jscheffl I know you requested my review — I’d be interested in your thoughts on whether this is a precedent we’re happy to set for triggerers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left more comments. I think this needs some discussion.
When running kubernetes pod operator in deferred mode we have implemented support for most callbacks, however the progress_callback one was still missing. This PR implements that callback from the triggerer, along with a fallback to the operator if any logs were missing from what the triggerer reports.
When the callbacks were first added, the original commiter noted in #35714 that
We could probably do something a lot more complicated to serialize/deserialize the callbacks, but as the callbacks need to be static anyway, I'm just putting their module path as a string and importing when de-serializing the trigger There might be a need to do something more complicated, but I think this is better than nothing.