Skip to content

Double counting of network transfer cost #7003

@fjetter

Description

@fjetter

We're double counting estimated network cost in multiple places

Forst, we're calculating the estimated network cost of dependencies a worker needs to fetch in _set_duration_estimate and are setting the result to WorkerState.processing, i.e. processing = compute + comm
This is also used to set the workers occupancy

When making a scheduling decision, we're typically using Scheduler.worker_objective which calculates a start_time that is defined as

stack_time: float = ws.occupancy / ws.nthreads
start_time: float = stack_time + comm_bytes / self.bandwidth

i.e.

start_time = ws.occupancy / ws.nthreads + comm_bytes / self.bandwidth
        = ws.occupancy / ws.nthreads + comm_cost

occupancy ~ sum( ... TaskPrefix.duration_average + comm_cost )
  1. comm cost should be constant and not scale with nthreads
  2. we should only account for comm_cost once

A similar double counting is introduced on work stealing side when calculating the cost_multiplier

compute_time = ws.processing[ts]  # occupancy
transfer_time = nbytes / self.scheduler.bandwidth + LATENCY
cost_multiplier = transfer_time / compute_time

# If we ignore latency for now, this yields something like

cost_multiplier ~ NBytes / (Bandwidth * duration_average + NBytes)
    =  (NBytes / Bandwidth) / (duration_average + NBytes / Bandwidth)

i.e. for network heavy tasks, this converges towards 1 which is quite the opposite of what this ratio is supposed to encode

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions