Skip to content

[Ballista] Load testing for Push-based task scheduling get stuck #2005

@yahoNanJing

Description

@yahoNanJing

Describe the bug

There are two bugs.

  • Losing the event SchedulerServerEvent::JobSubmitted results in job no longer to be scheduled
  • Concurrency issue of updating ExecutorData simultaneously.
  1. For the first bug:
    In the method of SchedulerServerEventAction::offer_resources, the returned available_executors may be all with 0 available_task_slots. In this case, there'll be no tasks to be scheduled for the job and no SchedulerServerEvent::JobSubmitted will be resent to the channel. As a result, the job will get stuck.

  2. For the second bug:
    The operations of get_executor_data and save_executor_data are not atomic, which may result in concurrency issue.

To Reproduce

Run loading test with Push-based task scheduling policy as described in #1983.

Expected behavior
A clear and concise description of what you expected to happen.

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions