Skip to content

Return 410 Gone for heartbeat when cleared TI exists in TIH#61631

Open
andreahlert wants to merge 6 commits intoapache:mainfrom
andreahlert:fix-heartbeat-410-gone-cleared-ti
Open

Return 410 Gone for heartbeat when cleared TI exists in TIH#61631
andreahlert wants to merge 6 commits intoapache:mainfrom
andreahlert:fix-heartbeat-410-gone-cleared-ti

Conversation

@andreahlert
Copy link
Contributor

Closes: #53140

When a running Task Instance is cleared, prepare_db_for_next_try() archives the current try to the Task Instance History (TIH) table and assigns a new UUID. This means heartbeats from the old process will fail with a 404 "Not Found" because the UUID no longer exists in the TI table. However, a generic 404 is misleading since the TI did exist - it was just cleared.

This PR adds a TIH lookup in the NoResultFound handler of the heartbeat endpoint:

  • If the UUID is not in TI but is in TIH: return 410 Gone, indicating the TI was cleared/moved
  • If the UUID is not in TI and not in TIH: return 404 Not Found (unchanged behavior)

This gives the task SDK supervisor a more specific signal and matches HTTP semantics (410 means "the target resource is no longer available at the origin server and this condition is likely to be permanent").

Changes:

  • execution_api/routes/task_instances.py: Added TIH existence check in the heartbeat NoResultFound handler; added 410 to OpenAPI response schema
  • execution_time/supervisor.py: Added HTTPStatus.GONE to the set of status codes that the supervisor treats as "stop heartbeating" (alongside NOT_FOUND and CONFLICT)
  • test_task_instances.py: Added test using prepare_db_for_next_try() to properly simulate a cleared TI and verify the 410 response

Previous attempt: #56443 (closed as stale). This PR addresses the review feedback from that PR by actually checking the TIH table rather than blindly changing 404 to 410.

@boring-cyborg boring-cyborg bot added area:API Airflow's REST/HTTP API area:task-sdk labels Feb 8, 2026
@amoghrajesh
Copy link
Contributor

It's an intermittent failure, restarting it and taking a look

Copy link
Contributor

@amoghrajesh amoghrajesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comments, otherwise LGTM thanks!

@andreahlert
Copy link
Contributor Author

Small comments, otherwise LGTM thanks!

Done! Thanks for reviewing!

@amoghrajesh
Copy link
Contributor

@andreahlert can you check the failing tests?

@amoghrajesh
Copy link
Contributor

Tests need fixing

@potiuk potiuk marked this pull request as draft March 10, 2026 23:54
@potiuk
Copy link
Member

potiuk commented Mar 10, 2026

@andreahlert This PR has been converted to draft because it does not yet meet our Pull Request quality criteria.

Issues found:

  • Other failing CI checks: Failing: Basic tests / React UI tests. Run prek run --from-ref main locally to reproduce. See static checks docs.

Note: Your branch is 488 commits behind main. Some check failures may be caused by changes in the base branch rather than by your PR. Please rebase your branch and push again to get up-to-date CI results.

What to do next:

  • The comment informs you what you need to do.
  • Fix each issue, then mark the PR as "Ready for review" in the GitHub UI - but only after making sure that all the issues are fixed.
  • Maintainers will then proceed with a normal review.

Converting a PR to draft is not a rejection — it is an invitation to bring the PR up to the project's standards so that maintainer review time is spent productively. If you have questions, feel free to ask on the Airflow Slack.

andreahlert and others added 5 commits March 11, 2026 15:38
When a running task instance is cleared, its previous try is archived
to the Task Instance History table and the TI receives a new UUID.
Subsequent heartbeats from the old process get a 404 because the old
UUID no longer exists in the TI table.

This change improves the error handling by checking the TIH table when
a heartbeat TI is not found. If the UUID exists in TIH, return 410
Gone instead of 404 Not Found, giving the client a more specific
signal that the task was cleared rather than never existing.

- Server: check TIH on heartbeat NoResultFound, return 410 if found
- Supervisor: handle 410 Gone same as 404/409 (terminate process)
- Keep 404 for TIs that genuinely never existed

closes: apache#53140
Co-authored-by: Amogh Desai <amoghrajesh1999@gmail.com>
Co-authored-by: Amogh Desai <amoghrajesh1999@gmail.com>
Co-authored-by: Amogh Desai <amoghrajesh1999@gmail.com>
… message

- Replace undefined ti_id_str with task_instance_id in TIH query and log
- Use task_instance_id (UUID) for TIH.task_instance_id comparison
- Set 410 Gone detail message to match test expectation
@andreahlert andreahlert force-pushed the fix-heartbeat-410-gone-cleared-ti branch from 0a744f6 to 0ef9c07 Compare March 11, 2026 19:06
@andreahlert andreahlert marked this pull request as ready for review March 11, 2026 20:58
@andreahlert andreahlert marked this pull request as draft March 12, 2026 03:26
@andreahlert andreahlert marked this pull request as ready for review March 12, 2026 03:26
@eladkal eladkal modified the milestones: Airflow 3.2.0, Airflow 3.1.9 Mar 12, 2026
@eladkal eladkal added type:bug-fix Changelog: Bug Fixes backport-to-v3-1-test Mark PR with this label to backport to v3-1-test branch labels Mar 12, 2026
@eladkal
Copy link
Contributor

eladkal commented Mar 12, 2026

Tests are failing

@andreahlert
Copy link
Contributor Author

andreahlert commented Mar 12, 2026

Rebased on main. GitHub Actions infra issue, let's see if gets green again.

@andreahlert
Copy link
Contributor Author

Rebased and ready for review. @amoghrajesh @ashb @kaxil

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:API Airflow's REST/HTTP API area:task-sdk backport-to-v3-1-test Mark PR with this label to backport to v3-1-test branch type:bug-fix Changelog: Bug Fixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Better error handling for running ti heartbeats after task is cleared

4 participants