feat: massStoreRunAsynchronous()#4326
Conversation
This patch implements the whole support ecosystem for server-side background tasks, in order to help lessen the load (and blocking) of API handlers in the web-server for long-running operations. A **Task** is represented by two things in strict co-existence: a lightweight, `pickle`-able implementation in the server's code (a subclass of `AbstractTask`) and a corresponding `BackgroundTask` database entity, which resides in the "configuration" database (shared across all products). A Task is created by API request handlers and then the user is instructed to retain the `TaskToken`: the task's unique identifier. Following, the server will dispatch execution of the object into a background worker process, and keep status synchronisation via the database. Even in a service cluster deployment, load balancing will not interfere with users' ability to query a task's status. While normal users can only query the status of a single task (which is usually automatically done by client code, and not the user manually executing something); product administrators, and especially server administrators have the ability to query an arbitrary set of tasks using the potential filters, with a dedicated API function (`getTasks()`) for this purpose. Tasks can be cancelled only by `SUPERUSER`s, at which point a special binary flag is set in the status record. However, to prevent complicating inter-process communication, cancellation is supposed to be implemented by `AbstractTask` subclasses in a co-operative way. The execution of tasks in a process and a `Task`'s ability to "communicate" with its execution environment is achieved through the new `TaskManager` instance, which is created for every process of a server's deployment. Unfortunately, tasks can die gracelessly if the server is terminated (either internally, or even externally). For this reason, the `DROPPED` status will indicate that the server has terminated prior to, or during a task's execution, and it was unable to produce results. The server was refactored significantly around the handling of subprocesses in order to support various server shutdown scenarios. Servers will start `background_worker_processes` number of task handling subprocesses, which are distinct from the already existing "API handling" subprocesses. By default, if unconfigured, `background_worker_processes` is equal to `worker_processes` (the number of API processes to spawn), which is equal to `$(nproc)` (CPU count in the system). This patch includes a `TestingDummyTask` demonstrative subclass of `AbstractTask` which counts up to an input number of seconds, and each second it gracefully checks whether it is being killed. The corresponding testing API endpoint, `createDummyTask()` can specify whether the task should simulate a failing status. This endpoint can only be used from, but is used extensively, the unit testing of the project. This patch does not include "nice" or "ergonomic" facilities for admins to manage the tasks, and so far, only the server-side of the corresponding API calls are supported.
This patch extends `CodeChecker cmd` with a new sub-command,
`serverside-tasks`, which lets users and administrators deal with
querying the status of running server-side tasks.
By default, the CLI queries the information of the task(s) specified by
their token(s) in the `--token` argument from the server using
`getTaskInfo(token)`, and shows this information in either verbose
"plain text" (available if precisely **one** task was specified), "table"
or JSON formats.
In addition to `--token`, it also supports 19 more parameters, each of
which correspond to a filter option in the `TaskFilter` API type.
If any filters in addition to `--token` is specified, it will exercise
`getTasks(filter)` instead.
This mode is only available to administrators.
The resulting, more detailed information structs are printed in "table"
or JSON formats.
Apart from querying the current status, two additional flags are
available, irrespective of which query method is used to obtain a list
of "matching tasks":
* `--kill` will call `cancelTask(token)` for each task.
* `--await` will block execution until the specified task(s) terminate
(in one way or another).
`--await` is implemented by calling the new **`await_task_termination`**
library function, which is implemented with the goal of being reusable
by other clients later.
627f70b to
1d1aacc
Compare
Separate the previously blocking execution of `massStoreRun()`, which was done in the context of the "API handler process", into a "foreground" and a "background" part, exploiting the previously implemented background task library support. The foreground part remains executed in the context of the API handler process, and deals with receiving and unpacking the to-be-stored data, saving configuration and checking constraints that are cheap to check. The foreground part can report issues synchronously to the client. Everything else that was part of the previous `massStoreRun()` pipeline, as implemented by the `mass_store_run.MassStoreRun` class becomes a background task, such as the parsing of the uploaded reports and the storing of data to the database. This background task, implemented using the new library, executes in a separate background worker process, and can not communicate directly with the user. Errors are logged to the `comments` fields. The `massStoreRun()` API handler will continue to work as previously, and block while waiting for the background task to terminate. In case of an error, it synchronously reports a `RequestFailed` exception, passing the `comments` field (into which the background process had written the exception details) to the client. Due to the inability for most of the exceptions previously caused in `MassStoreRun` to "escape" as `RequestFailed`s, some parts of the API had been deprecated and removed. Namely, `ErrorCode.SOURCE_FILE` and `ErrorCode.REPORT_FORMAT` are no longer sent over the API. This does not break existing behaviour and does not cause an incompatibility with clients: in cases where the request exceptions were raised earlier, now a different type of exception is raised, but the error message still precisely explains the problem as it did previously.
cbe234b to
ffc6bc1
Compare
Even though commit d915473 introduced a socket-level TCP keepalive support into the server's implementation, this was observed multiple times to not be enough to **deterministically** fix the issues with the `CodeChecker store` client hanging indefinitely when the server takes a long time processing the to-be-stored data. The underlying reasons are not entirely clear and the issue only pops up sporadically, but we did observe a few similar scenarios (such as multi-million report storage from analysing LLVM and then storing between datacentres) where it almost reliably reproduces. The symptoms (even with a configure `kepalive`) generally include the server not becoming notified about the client's disconnect, while the client process is hung on a low-level system call `read(4, ...)`, trying to get the Thrift response of `massStoreRun()` from the HTTP socket. Even if the server finishes the storage processing "in time" and sent the Thrift reply, it never reaches the client, which means it never exits from the waiting, which means it keeps either the terminal or, worse, a CI script occupied, blocking execution. This is the "more proper solution" foreshadowed in commit 15af7d8. Implemented the server-side logic to spawn a `MassStoreRun` task and return its token, giving the `massStoreRunAsynchronous()` API call full force. Implemented the client-side logic to use the new `task_client` module and the same logic as `CodeChecker cmd serverside-tasks --await --token TOKEN...` to poll the server for the task's completion and status.
ffc6bc1 to
e30fcc2
Compare
There was a problem hiding this comment.
I tested this patch and it seems to me that the storage performance is lower than with the 6.24.1 release.
On single storage thread storing the same set of xerces reports with 6.24.1 server the storage time is ~9s (on the client) and with this version is around ~19s. See the details below.
Did you observe similar behaviour? What could be the reason?
Otherwise the queue handling and the server seemed stable.
Reproduction:
- 6.24.1 client 6.24.1 server:
time CodeChecker store /workspace/test-projects/xerces-c/reports --url http://localhost:8002/Default -f -n testing
----======== Summary ========----
-------------------------------------------------------
Number of processed analyzer result files | 700
Number of analyzer reports | 1322
Number of source files | 0
Number of source files with source code comments | 0
Number of blame information files | 0
-------------------------------------------------------
----=================----
[INFO 2024-09-23 18:27] - Compressing report zip file...
[INFO 2024-09-23 18:27] - Compressing report zip file done (8.3MiB / 238.5KiB).
[INFO 2024-09-23 18:27] - Storing results to the server...
[INFO 2024-09-23 18:28] - Storage finished successfully.
real 0m9.046s
user 0m9.662s
sys 0m1.812s
- 6.24.1 client this patch version server
ime CodeChecker store /workspace/test-projects/xerces-c/reports --url http://localhost:8001/Default -f -n testing
[INFO 2024-09-23 18:29] - Storing analysis results for run 'testing
----======== Summary ========----
-------------------------------------------------------
Number of processed analyzer result files | 700
Number of analyzer reports | 1322
Number of source files | 0
Number of source files with source code comments | 0
Number of blame information files | 0
-------------------------------------------------------
----=================----
[INFO 2024-09-23 18:29] - Compressing report zip file...
[INFO 2024-09-23 18:29] - Compressing report zip file done (8.3MiB / 238.8KiB).
[INFO 2024-09-23 18:29] - Storing results to the server...
[INFO 2024-09-23 18:29] - Storage finished successfully.
real 0m19.856s
user 0m9.905s
sys 0m1.585s
- this version client - this version server
time CodeChecker store /workspace/test-projects/xerces-c/reports --url http://localhost:8001/Default -n testing
----======== Summary ========----
-------------------------------------------------------
Number of processed analyzer result files | 700
Number of analyzer reports | 1322
Number of source files | 0
Number of source files with source code comments | 0
Number of blame information files | 0
-------------------------------------------------------
----=================----
[INFO 2024-09-23 18:30] - Compressing report zip file...
[INFO 2024-09-23 18:30] - Compressing report zip file done (8.3 MiB / 238.8 KiB).
[INFO 2024-09-23 18:30] - Storing results to the server ...
[INFO 2024-09-23 18:30] - Reports submitted to the server for processing.
[INFO 2024-09-23 18:30] - Checking local passwords or tokens in /home/ednikru/.codechecker.passwords.json
[INFO 2024-09-23 18:30] - Checking for local valid sessions.
[INFO 2024-09-23 18:30] - Storing the reports finished successfully.
real 0m15.813s
user 0m9.852s
sys 0m1.668s
------------------------
I also tested this with the performance tester tool
`python3 /workspace/test/codechecker/scripts/test/run_server_performance_test.py /workspace/test-projects/xerces-c/reports --url http://localhost:8001/Default -o /tmp/codechecker_performance_625.log -u 10 -r 1`
iff /tmp/codechecker_performance_6241.log /tmp/codechecker_performance_625.log
2,11c2,11
< 2,15.245582,6.167334,3.281972,0.927049,0.845769
< 5,15.377179,5.008921,3.431427,1.165954,0.953085
< 9,15.298618,6.457655,3.51786,1.46451,3.273755
< 4,18.013311,5.154086,4.70034,0.84808,2.343666
< 8,16.559763,5.145486,3.394403,1.334375,0.92755
< 3,19.000909,5.082276,3.441946,1.04042,0.845659
< 7,20.805081,4.975545,3.223493,0.853344,1.445693
< 10,24.562988,5.192637,3.201599,0.960122,0.814222
< 1,10.985903,5.026051,3.286415,1.072666,0.704679
< 6,11.313735,4.791362,3.351084,0.939348,0.935826
7,29.80368,5.451236,4.618323,2.434601,2.163928
8,40.068657,6.56116,4.520448,2.319387,1.735899
10,40.240583,5.458642,3.72705,1.445456,1.494371
4,41.033482,5.581793,3.885928,1.479313,1.115965
2,45.766335,5.268967,3.681706,2.609846,1.123228
5,55.890535,5.370175,3.754132,1.426652,1.461334
1,45.691527,5.271719,3.534699,1.287154,1.023857
3,60.307092,5.332355,3.763976,1.282365,0.998435
6,55.516671,5.226814,3.734352,1.628265,1.058211
The client is sleeping, as expected (as evident from
Not to a measurable extent, but I was not doing extremely strict measurements either.
It's a scheduling artefact. Consider that the original storage is about 10 seconds long. Now, let's suppose the unzipping takes virtually 0 time. What will happen, is that the client sends the mSRA() request, and gets the response, then sends a poll. This poll will naturally say What could also happen (but likely did not happen here; also, please try with The good news is that the minimum and maximum threshold for the polling sleep value can be tinkered with, both at the library level, and at the call site in This explains both the observations of 15-sec wall clock in the "both server and client patched" scenario, and the "basically 20-sec" wall clock in the "server patched, client not" scenario. N.B., that using a "legacy" client, and calling mSR(), the sleeping is done in the serverside API handler, and all of this fancy "variable resistor" behaviour of the sleep time calculation is skipped: there is a 5-sec sleep rhythm, but that's (as usual on non-realtime kernels) not a hard accurate thing. So the server is, once again, sleeping in 5 sec increments, basically mirroring what the client is close to doing. while True:
time.sleep(5)
t = self._task_manager.get_task_record(token)
Asymptotically, min=1sec could also be viable, but in large deployments, that would result in a lot of flooding early on. The optimisation question is, do we expect small or large workloads? Because for large workloads (where the server-side processing takes minutes, maybe tens of minutes), these changes will not be so impactful. It's only this impactful at, let's say, academically trivial workloads. |
|
I tested this implementation further with larger storages: ~156000 reports. The storage takes ~9 minutes on the 6.24.1 server and 10 minutes 46 seconds on the new server. So I approve the PR. See the detailed results below. |
dkrupp
left a comment
There was a problem hiding this comment.
I think the concept of executing these long lasting storages as background jobs is very good as it makes it very clear to administrators what storage jobs are executed in parallel, gives a good overview of the storage times and it makes fixes the long resetting TCP connections problem too.
I think it would be a valuable addition in the future to be able to query the latest queuing time, to be able to detect (for operators) if the storage queu is very long and maybe increasing.
LGTM, nice work!
Important
⛔ Blocked by #4318.
This is patch 3 of the Asynchronous Store Protocol (#3672).
Closes #1306. Closes #2417. Closes #3671. Closes #3672.
Motivation
Even though #3167 introduced a socket-level TCP keepalive support into the server's implementation, this was observed multiple times to not be enough to deterministically fix the issues with the
CodeChecker storeclient hanging indefinitely (#3672) when the server takes a long time processing the to-be-stored data. The underlying reasons are not entirely clear and the issue only pops up sporadically, but we did observe a few similar scenarios (such as multi-million report storage from analysing LLVM and then storing between data centres) where it almost reliably reproduces.The symptoms (even with a configured
keepalive) generally include the server not becoming notified about the client's disconnect, while the client process is hung on a low-level system callread(4, ...), trying to get the Thrift response ofmassStoreRun()from the HTTP socket. Even if the server finishes the storage processing "in time" and sends the Thrift reply, it never reaches the client, which means it never exits from the waiting, which means it keeps either the terminal or, worse, a CI script occupied, blocking execution.Solution, part 1. Backgrounding
massStoreRun()To solve this, I separated the previously blocking execution of
massStoreRun(), which was done in the context of the "API handler process", into a foreground and a background part, exploiting the background task library support implemented in #4317.The foreground part remains executed in the context of the API handler process, and deals with receiving and unpacking the to-be-stored data, saving configuration and checking constraints that are cheap to check.
The foreground part can report issues synchronously to the client.
Everything else that was part of the previous
massStoreRun()pipeline, as implemented by themass_store_run.MassStoreRunclass becomes a background task, such as the parsing of the uploaded reports and the storing of data to the database.This background task, implemented using the new library, executes in a separate background worker process, and can not communicate directly with the user.
Errors are logged to the
commentsfields.Compatibility with older clients
The
massStoreRun()API handler will continue to work as previously, and block while waiting for the background task to terminate, but even on the server side, it will use the new facilities to do so. In case of an error, it synchronously reports aRequestFailedexception, passing thecommentsfield (into which the background process had written the exception details) to the client.While this ensures compatibility with older clients, it is also completely "compatible" with and does not solve the previously seen bug!
Removal of special error tags from the transmission
Due to the inability of most of the exceptions previously caused in
MassStoreRunto "escape" asRequestFaileds, some parts of the API had been deprecated and removed.Namely,
ErrorCode.SOURCE_FILEandErrorCode.REPORT_FORMATare no longer sent over the API.This does not break existing behaviour and does not cause incompatibility with clients: in cases where the request exceptions were raised earlier, now a different type of exception is raised, but the error message still precisely explains the problem as it did previously.
Solution, part 2. Exposing
massStoreRunAsynchronous()This is the "more proper solution" foreshadowed in #4039.
Based on Part 1.'s (and earlier patches') achievements, I added the server-side implementation for
massStoreRunAsynchronous(), which is almost trivial at this point: it performs the same spawning of the now-backgrounded task instance for the store processing and returns the task token via the API.At this point, the API handler process can fully return and start waiting for the next request.
As an added benefit, the memory associated with the potentially huge store ZIP (under the
b64zipparameter ofmassStoreRun()and thezipfile_blobparameter ofmassStoreRunAsynchronous()) is also freed at this point, because the contents of the ZIP is already extracted to storage, freeing not just "available workers" as a resource, but "memory" as well!The
CodeChecker storeclient now uses the new API, receives the task token, and starts to poll the server (withCodeChecker cmd serverside-tasks --await --token TOKEN...'s infrastructure, as already established in #4318) for the status of the task.CodeChecker storestill blocks the terminal (or the executing script) until the store reaches a termination status.CodeChecker store --detachAnother benefit of putting the entire report processing logic put into the background (and removed from the API control flow) on the server is that the client can now safely close the TCP socket and disengage from the API once the server reports that it is ready to process the data. Notably, the API communication that initiates the store process,
massStoreRunAsynchronous(), terminates when the task token is sent back to the client. Everything afterwards happens under new communication requests established by the client.This immediately allowed me to implement a small but powerful new feature into the client,
--detach: if the user so wishes to NOT await the status of the server-side processing automatically, they can instruct theCodeChecker storeclient to exit and return control to the terminal/script, once the server-side operation had begun.This is extremely useful in situations where running the
storeoperation is the last action to be taken and no further commands (such ascmd resultsorcmd diff) are executed which might rely on the "success" or "completion" or "fulfilment" of the act. Consider a CI system which only updates the reports from themasterbranch in a scheduled way. These jobs can be sped up (by virtue of returning the entire allocated CI entity to a resting state) by providing--detach.