feat(server): Asynchronous server-side background task execution#4603
Merged
bruntib merged 2 commits intoEricsson:masterfrom Aug 12, 2025
Merged
feat(server): Asynchronous server-side background task execution#4603bruntib merged 2 commits intoEricsson:masterfrom
bruntib merged 2 commits intoEricsson:masterfrom
Conversation
6b01c3f to
b27771e
Compare
gulyasgergely902
approved these changes
Jun 26, 2025
This patch implements the whole support ecosystem for server-side background tasks, in order to help lessen the load (and blocking) of API handlers in the web-server for long-running operations. A **Task** is represented by two things in strict co-existence: a lightweight, `pickle`-able implementation in the server's code (a subclass of `AbstractTask`) and a corresponding `BackgroundTask` database entity, which resides in the "configuration" database (shared across all products). A Task is created by API request handlers and then the user is instructed to retain the `TaskToken`: the task's unique identifier. Following, the server will dispatch execution of the object into a background worker process, and keep status synchronisation via the database. Even in a service cluster deployment, load balancing will not interfere with users' ability to query a task's status. While normal users can only query the status of a single task (which is usually automatically done by client code, and not the user manually executing something); product administrators, and especially server administrators have the ability to query an arbitrary set of tasks using the potential filters, with a dedicated API function (`getTasks()`) for this purpose. Tasks can be cancelled only by `SUPERUSER`s, at which point a special binary flag is set in the status record. However, to prevent complicating inter-process communication, cancellation is supposed to be implemented by `AbstractTask` subclasses in a co-operative way. The execution of tasks in a process and a `Task`'s ability to "communicate" with its execution environment is achieved through the new `TaskManager` instance, which is created for every process of a server's deployment. Unfortunately, tasks can die gracelessly if the server is terminated (either internally, or even externally). For this reason, the `DROPPED` status will indicate that the server has terminated prior to, or during a task's execution, and it was unable to produce results. The server was refactored significantly around the handling of subprocesses in order to support various server shutdown scenarios. Servers will start `background_worker_processes` number of task handling subprocesses, which are distinct from the already existing "API handling" subprocesses. By default, if unconfigured, `background_worker_processes` is equal to `worker_processes` (the number of API processes to spawn), which is equal to `$(nproc)` (CPU count in the system). This patch includes a `TestingDummyTask` demonstrative subclass of `AbstractTask` which counts up to an input number of seconds, and each second it gracefully checks whether it is being killed. The corresponding testing API endpoint, `createDummyTask()` can specify whether the task should simulate a failing status. This endpoint can only be used from, but is used extensively, the unit testing of the project. This patch does not include "nice" or "ergonomic" facilities for admins to manage the tasks, and so far, only the server-side of the corresponding API calls are supported.
The test files contain hard-coded sleep() operations. These are lowered.
This was referenced Nov 27, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Important
This is patch 1 of the Asynchronous Store Protocol (#3672).
This patch was made by @whisperity in this patch: #4317
This patch implements the whole support ecosystem for server-side background tasks, in order to help lessen the load, occupancy, and potentially detrimental blocking behaviour on API handler processes, occupied by long-running operations in the web-server, such as
massStoreRun().API workers vs. background/task workers
In this patch, the processing model of the
CodeChecker serveris extended with the concept of background workers. These special processes deal with consuming tasks off of the server's queue, and executing them. The previous clone processes of a server are now and henceforth termed API workers, and they do the same as prior: respond to Thrift RPC calls.The server will start
background_worker_processesnumber of background processes, which is, by default, the same number asworker_processes, the number of API handlers. (Which defaults to$(nproc).)Self-healing against dead children
A fix is also implemented in this patch which is the result of a side effect that stemmed from reviewing and reworking both the threading and the signalling model of a
CodeChecker serverprocess tree. Previously, exceptions escaping from an API worker process (or it getting OOM-killed, etc.) would result in an initially large number of workers to die off one by one, leaving the server with a monotonously descending number of workers to dispatch RPC requests to. In this patch, the server is scripted to handleSIGCHLDand respawn dead child processes (both API and Task) when given the opportunity.Graceful server shutdown
The server's life cycle and signal handling logic refactoring also includes changes which makes the server process(es) more graceful in their termination, which is needed to accurately clean-up pending tasks (see later).
In case a brutal and unclean server shutdown is needed, following this patch, the
SIGINTorSIGTERMsignal must be sent twice to the server.(Alternatively, just use
SIGKILL! 😉)Task management
A Task is represented by two things in strict co-existence: a lightweight,
pickle-able implementation of the task's execution logic (subclassed fromAbstractTaskand overriding_implementation()); and a correspondingBackgroundTaskentity in the database. These records reside in the CONFIGURATION database, and the task information is server-, or service-wide, shared across all products.The database-synchronised record contains several human-readable metadata about the task, such as a one-line
summaryfield and the larger, log-likecommentscolumn, with several timestamps recorded for the crucial moments during the task's execution.The most important flag is a task's status, which can be:
ALLOCATED,ENQUEUED,RUNNING,COMPLETED,FAILED,CANCELLED, andDROPPED.Normal path
The life cycle of a task is as follows:
Taskis created as the result of an API request (executed in the API worker process).a. (Currently, for the sake of both testing and demonstration, only the
createDummyTask()API, and only in a test context, can create a task.)ALLOCATED, together with theBackgroundTaskdatabase entity.Taskobject into a shared, synchronised queue within the server's memory space.At this point, the task is considered
ENQUEUED.a.
AbstractTasksubclasses MUST bepickle-able and reasonably small.b. The library offers means to store additional large data on the file system, in a temporary directory specific to the task.
getTaskInfo()API (executed in the context of any API worker process, synchronised over the database) to query whether the task was completed, if the user wishes to receive this information.Taskobject from the queue.After some bookkeeping, the task will be
RUNNING.MyTaskClass::_implementation()is called, which executes the task's primary business logic._implementation()returns without a failure, the task will be considered successfullyCOMPLETED.Any exception escaping from the method will set the task to
FAILED, and exception information is logged into theBackgroundTask.commentscolumn of the database.Together, these two are the "Normal termination states.".
Taskis available from the queue.Abnormal path 1: admin cancellation
At any point following
ALLOCATEDstatus, but most likely in theENQUEUEDandRUNNINGstatuses, aSUPERUSERmay issue acancelTask()order.This will set
BackgroundTask.cancel_flag, and the task is expected (although not required!) to poll its ownshould_cancel()status internally in checkpoints, and terminate gracefully to this request. This is done by_implementation()exiting by raising aTaskCancelHonouredexception.(If the task does not raise one, it will be allowed to conclude normally, or fail in some other manner.
Tasks cancelled gracefully will have the
CANCELLEDstatus.For example, a background task that performs an action over a set of input files generally should be implemented like this:
Abnormal path 2: server shutdown
Alternatively, at any point in this life cycle, the server might receive the command to terminate itself (kill signals
SIGINT,SIGTERM; alternatively caused byCodeChecker server --stop). Following the termination of API workers, the background workers will also shut down one by one.At this point, the default behaviour is to cause a special cancel event which tasks currently
RUNNINGmay still gracefully honour, as-if it was aSUPERUSER's single-task cancel request. All other tasks that have not started executing yet and are in theALLOCATEDorENQUEUEDstatus will never start.All tasks not in a normal termination state will be set to the
DROPPEDstatus, with thecommentsfield containing a log about the specifics of in which state the task was dropped, and why. (Together,CANCELLEDandDROPPEDare the "abnormal termination states", indicating that the task terminated due to some external influence.)Task querying
The
getTaskInfo()API, querying the status of one task, is available to the user who caused the task to spawn, thePRODUCT_ADMINs of the product associated with the task (if any), andSUPERUSERs.The
getTasks()API, which queries multiple tasks based on a filter set, is available only toPRODUCT_ADMINs (results restricted to the products they are admin thereof) andSUPERUSERs (unrestricted).Just about anything that is available as information in the database about a task can be queried.
--machine-idUnfortunately, servers don't always terminate gracefully (cue the aforementioned
SIGKILL, but also the container, VM, or the host machine could simply die during execution, in ways the server is not able to handle). Because tasks are not shared across server processes, and there are crucial bits of information in the now dead process's memory which would have been needed to execute the task, a server later restarting in place of a previously dead one should be able to identify which tasks its "predecessor" left behind without clean-up.This is achieved by storing the running computer's identifier, configurable via
CodeChecker server --machine-id, as an additional piece of information for each task. By default, the machine ID is constructed fromgethostname():portnumber, e.g.,cc-server:80.In containerised environments, relying on
gethostname()may not be entirely stable!For example, Docker exposes the first 12 digits of the container's unique hash as the "hostname" of the insides of the container. If the container is started with
--restart alwaysor--restart unless-stopped, then this is fine, however, more advanced systems, such as Docker swarm will create a new container in case the old one died (!), resulting in a new value ofgethostname().In such environments, service administrators must pay additional caution and configure their instances by setting
--machine-idfor subsequent executions of the "same" server accordingly. If a server with machine IDMstarts up (usually after a container or "system" restart), it will set every task not in any "termination states" and associated with machine IDMto theDROPPEDstatus (with an appropriately formatted comment accompanying), signifying that the previous instance "dropped" these tasks, but had no chance of recording this fact.