Skip to content

Failed to fetch log file from worker. 404 Client Error: NOT FOUND #30743

@Gollum999

Description

@Gollum999

Apache Airflow version

Other Airflow 2 version (please specify below)

What happened

If a task gets scheduled on a different worker host from a previous run, logs from that previous run will be unavailable from the webserver.

nkcairflow02.tradebot.com
*** Log file does not exist: /opt/tradebot/logs/airflow/dag_id=cluster_wiretap_download_mnj/run_id=scheduled__2023-04-18T05:00:00+00:00/task_id=chunk_1200_1230.download_chunk_1200_1230/attempt=2.log
*** Fetching from: http://nkcairflow02.tradebot.com:8793/log/dag_id=cluster_wiretap_download_mnj/run_id=scheduled__2023-04-18T05:00:00+00:00/task_id=chunk_1200_1230.download_chunk_1200_1230/attempt=2.log
*** Failed to fetch log file from worker. 404 Client Error: NOT FOUND for url: http://nkcairflow02.tradebot.com:8793/log/dag_id=cluster_wiretap_download_mnj/run_id=scheduled__2023-04-18T05:00:00+00:00/task_id=chunk_1200_1230.download_chunk_1200_1230/attempt=2.log
For more information check: https://httpstatuses.com/404

From my observation, it's pretty clear that this is happening because Airflow is looking for logs at the most recent hostname, not at the hostname that hosted the historical run. In the example above, attempt 2 actually ran on nkcairflow06, but the webserver is trying to fetch logs from nkcairflow02 because that is where attempt 3 ran. You can watch this happen "live" with the example DAG below - each time the task retries on a different worker, the URL that it attempts to use for historical logs changes to use the most recent hostname.

What you think should happen instead

All previous logs should be available from the webserver (as long as the log file still exists on disk).

How to reproduce

#!/usr/bin/env python3
import datetime
import logging
import socket

from airflow.decorators import dag, task


logger = logging.getLogger(__name__)


@dag(
    schedule_interval='@daily',
    start_date=datetime.datetime(2023, 4, 19),
    default_args={
        'retries': 9,
        'retry_delay': 10.0,
    },
)
def test_logs():
    @task
    def log():
        logger.info(f'Running from {socket.gethostname()}')
        raise RuntimeError('force fail')

    log()


test_logs()

Operating System

CentOS Stream 8

Versions of Apache Airflow Providers

N/A

Deployment

Other

Deployment details

Airflow 2.5.1
Self-hosted
Postgres DB backend
CeleryExecutor

Anything else

Related issues:

Looks like #23178 attempted to fix the issue, but it never got merged.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions