Skip to content

feat(server): Asynchronous server-side background task execution#4603

Merged
bruntib merged 2 commits intoEricsson:masterfrom
bruntib:async_store_1
Aug 12, 2025
Merged

feat(server): Asynchronous server-side background task execution#4603
bruntib merged 2 commits intoEricsson:masterfrom
bruntib:async_store_1

Conversation

@bruntib
Copy link
Copy Markdown
Contributor

@bruntib bruntib commented Jun 16, 2025

Important

This is patch 1 of the Asynchronous Store Protocol (#3672).
This patch was made by @whisperity in this patch: #4317

This patch implements the whole support ecosystem for server-side background tasks, in order to help lessen the load, occupancy, and potentially detrimental blocking behaviour on API handler processes, occupied by long-running operations in the web-server, such as massStoreRun().

API workers vs. background/task workers

In this patch, the processing model of the CodeChecker server is extended with the concept of background workers. These special processes deal with consuming tasks off of the server's queue, and executing them. The previous clone processes of a server are now and henceforth termed API workers, and they do the same as prior: respond to Thrift RPC calls.

The server will start background_worker_processes number of background processes, which is, by default, the same number as worker_processes, the number of API handlers. (Which defaults to $(nproc).)

Self-healing against dead children

A fix is also implemented in this patch which is the result of a side effect that stemmed from reviewing and reworking both the threading and the signalling model of a CodeChecker server process tree. Previously, exceptions escaping from an API worker process (or it getting OOM-killed, etc.) would result in an initially large number of workers to die off one by one, leaving the server with a monotonously descending number of workers to dispatch RPC requests to. In this patch, the server is scripted to handle SIGCHLD and respawn dead child processes (both API and Task) when given the opportunity.

Graceful server shutdown

The server's life cycle and signal handling logic refactoring also includes changes which makes the server process(es) more graceful in their termination, which is needed to accurately clean-up pending tasks (see later).

In case a brutal and unclean server shutdown is needed, following this patch, the SIGINT or SIGTERM signal must be sent twice to the server.
(Alternatively, just use SIGKILL! 😉)

Task management

A Task is represented by two things in strict co-existence: a lightweight, pickle-able implementation of the task's execution logic (subclassed from AbstractTask and overriding _implementation()); and a corresponding BackgroundTask entity in the database. These records reside in the CONFIGURATION database, and the task information is server-, or service-wide, shared across all products.

The database-synchronised record contains several human-readable metadata about the task, such as a one-line summary field and the larger, log-like comments column, with several timestamps recorded for the crucial moments during the task's execution.

The most important flag is a task's status, which can be: ALLOCATED, ENQUEUED, RUNNING, COMPLETED, FAILED, CANCELLED, and DROPPED.

Normal path

The life cycle of a task is as follows:

  1. A Task is created as the result of an API request (executed in the API worker process).
    a. (Currently, for the sake of both testing and demonstration, only the createDummyTask() API, and only in a test context, can create a task.)
  2. A unique identifier, termed a task token is ALLOCATED, together with the BackgroundTask database entity.
  3. The API worker pushes the Task object into a shared, synchronised queue within the server's memory space.
    At this point, the task is considered ENQUEUED.
    a. AbstractTask subclasses MUST be pickle-able and reasonably small.
    b. The library offers means to store additional large data on the file system, in a temporary directory specific to the task.
  4. The task token is returned to the user via the RPC API call, and the API worker is free too respond to other requests.
  5. In a loop with some frequency, the user exercises the getTaskInfo() API (executed in the context of any API worker process, synchronised over the database) to query whether the task was completed, if the user wishes to receive this information.
  6. A background worker process (which is a separate context, but executed in the same process tree as the server's API requests — there is NO publishing of tasks between machines!) wakes up to the enqueuement, and pops the Task object from the queue.
    After some bookkeeping, the task will be RUNNING.
  7. MyTaskClass::_implementation() is called, which executes the task's primary business logic.
  8. If _implementation() returns without a failure, the task will be considered successfully COMPLETED.
    Any exception escaping from the method will set the task to FAILED, and exception information is logged into the BackgroundTask.comments column of the database.
    Together, these two are the "Normal termination states.".
  9. Eventually, the user will receive (if waiting for this) the fact that the task had completed.
  10. The background worker process goes back to sleep until a new Task is available from the queue.

Abnormal path 1: admin cancellation

At any point following ALLOCATED status, but most likely in the ENQUEUED and RUNNING statuses, a SUPERUSER may issue a cancelTask() order.
This will set BackgroundTask.cancel_flag, and the task is expected (although not required!) to poll its own should_cancel() status internally in checkpoints, and terminate gracefully to this request. This is done by _implementation() exiting by raising a TaskCancelHonoured exception.
(If the task does not raise one, it will be allowed to conclude normally, or fail in some other manner.
Tasks cancelled gracefully will have the CANCELLED status.

For example, a background task that performs an action over a set of input files generally should be implemented like this:

def _implementation(tm: TaskManager):
    for file in INPUTS:
      if tm.should_cancel(self):
          ROLLBACK()
          raise TaskCancelHonoured(self)

      DO_LOGIC(file)

Abnormal path 2: server shutdown

Alternatively, at any point in this life cycle, the server might receive the command to terminate itself (kill signals SIGINT, SIGTERM; alternatively caused by CodeChecker server --stop). Following the termination of API workers, the background workers will also shut down one by one.
At this point, the default behaviour is to cause a special cancel event which tasks currently RUNNING may still gracefully honour, as-if it was a SUPERUSER's single-task cancel request. All other tasks that have not started executing yet and are in the ALLOCATED or ENQUEUED status will never start.

All tasks not in a normal termination state will be set to the DROPPED status, with the comments field containing a log about the specifics of in which state the task was dropped, and why. (Together, CANCELLED and DROPPED are the "abnormal termination states", indicating that the task terminated due to some external influence.)

Task querying

The getTaskInfo() API, querying the status of one task, is available to the user who caused the task to spawn, the PRODUCT_ADMINs of the product associated with the task (if any), and SUPERUSERs.

The getTasks() API, which queries multiple tasks based on a filter set, is available only to PRODUCT_ADMINs (results restricted to the products they are admin thereof) and SUPERUSERs (unrestricted).
Just about anything that is available as information in the database about a task can be queried.

--machine-id

Unfortunately, servers don't always terminate gracefully (cue the aforementioned SIGKILL, but also the container, VM, or the host machine could simply die during execution, in ways the server is not able to handle). Because tasks are not shared across server processes, and there are crucial bits of information in the now dead process's memory which would have been needed to execute the task, a server later restarting in place of a previously dead one should be able to identify which tasks its "predecessor" left behind without clean-up.

This is achieved by storing the running computer's identifier, configurable via CodeChecker server --machine-id, as an additional piece of information for each task. By default, the machine ID is constructed from gethostname():portnumber, e.g., cc-server:80.

In containerised environments, relying on gethostname() may not be entirely stable!
For example, Docker exposes the first 12 digits of the container's unique hash as the "hostname" of the insides of the container. If the container is started with --restart always or --restart unless-stopped, then this is fine, however, more advanced systems, such as Docker swarm will create a new container in case the old one died (!), resulting in a new value of gethostname().

In such environments, service administrators must pay additional caution and configure their instances by setting --machine-id for subsequent executions of the "same" server accordingly. If a server with machine ID M starts up (usually after a container or "system" restart), it will set every task not in any "termination states" and associated with machine ID M to the DROPPED status (with an appropriately formatted comment accompanying), signifying that the previous instance "dropped" these tasks, but had no chance of recording this fact.

@bruntib bruntib added this to the release 6.27.0 milestone Jun 16, 2025
@bruntib bruntib requested review from dkrupp and vodorok as code owners June 16, 2025 08:41
@bruntib bruntib force-pushed the async_store_1 branch 2 times, most recently from 6b01c3f to b27771e Compare June 19, 2025 13:37
whisperity and others added 2 commits August 7, 2025 10:35
This patch implements the whole support ecosystem for server-side
background tasks, in order to help lessen the load (and blocking) of API
handlers in the web-server for long-running operations.

A **Task** is represented by two things in strict co-existence: a
lightweight, `pickle`-able implementation in the server's code (a
subclass of `AbstractTask`) and a corresponding `BackgroundTask`
database entity, which resides in the "configuration" database (shared
across all products).
A Task is created by API request handlers and then the user is
instructed to retain the `TaskToken`: the task's unique identifier.
Following, the server will dispatch execution of the object into a
background worker process, and keep status synchronisation via the
database.
Even in a service cluster deployment, load balancing will not interfere
with users' ability to query a task's status.

While normal users can only query the status of a single task (which is
usually automatically done by client code, and not the user manually
executing something); product administrators, and especially server
administrators have the ability to query an arbitrary set of tasks using
the potential filters, with a dedicated API function (`getTasks()`) for
this purpose.

Tasks can be cancelled only by `SUPERUSER`s, at which point a special
binary flag is set in the status record.
However, to prevent complicating inter-process communication,
cancellation is supposed to be implemented by `AbstractTask` subclasses
in a co-operative way.
The execution of tasks in a process and a `Task`'s ability to
"communicate" with its execution environment is achieved through the new
`TaskManager` instance, which is created for every process of a server's
deployment.

Unfortunately, tasks can die gracelessly if the server is terminated
(either internally, or even externally).
For this reason, the `DROPPED` status will indicate that the server has
terminated prior to, or during a task's execution, and it was unable to
produce results.
The server was refactored significantly around the handling of subprocesses
in order to support various server shutdown scenarios.

Servers will start `background_worker_processes` number of task handling
subprocesses, which are distinct from the already existing "API
handling" subprocesses.
By default, if unconfigured, `background_worker_processes` is equal to
`worker_processes` (the number of API processes to spawn), which is
equal to `$(nproc)` (CPU count in the system).

This patch includes a `TestingDummyTask` demonstrative subclass of
`AbstractTask` which counts up to an input number of seconds, and each
second it gracefully checks whether it is being killed.
The corresponding testing API endpoint, `createDummyTask()` can specify
whether the task should simulate a failing status.
This endpoint can only be used from, but is used extensively, the unit
testing of the project.

This patch does not include "nice" or "ergonomic" facilities for admins
to manage the tasks, and so far, only the server-side of the
corresponding API calls are supported.
The test files contain hard-coded sleep() operations. These are lowered.
@bruntib bruntib merged commit 218735d into Ericsson:master Aug 12, 2025
7 of 9 checks passed
@bruntib bruntib deleted the async_store_1 branch August 12, 2025 20:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants