This is a write-up of a technical specification for an idea that I am sure we have been having for several years now, but usually was discussed in private or only orally.
store, more specifically the API call to massStoreRun waits and blocks until the result of the store is returned to the client. As processing a store action takes a non-trivial amount of time on the serverside (and this operation is also executed only on one thread!), this means that returning from massStoreRun itself takes a non-trivial amount of time. The problem surfaces if the connection between the client and the server coughs, chokes, misbehaves, because it is only the networking stack in the kernel that is keeping the door open for the reply to arrive. While a disappearing client is no problem from the server's side and data won't be lost, CI jobs can hang indefinitely, or scripts that expect data to be available for cmd query after a return from store will break apart.
The proposal is to switch the blocking from relying on externalia like "the TCP stack" into a softer, but more local, blocking mechanism, while also turning the API itself asynchronous. This proposal is backwards compatible.
Database changes
We already have information in RunLock as to what runs are undergoing a store. However, this is not enough, we need to store some semi-temporary information about store "attempts" or "sessions". This could go into its own table, per product, as this needs to be kept for a time even if the run lock is released. This table would contain the run name, a unique session token/identifier, and some status flag. The identifier might be auto-incremented, or a hash of the time when the lock was initialised, it is not a "secret" resource.
These identifiers should be garbage collected in the usual process.
CLI changes
There are no changes needed on the CLI. Optionally, the store command might be extended with a --no-block argument which makes it immediately exit and return to shell once the server started processing the data, in case the user does not care about when the operation finished.
API changes
A new endpoint, hereby referred to as massStoreRunAsync shall be created. This function should return the aforementioned "store session token", or throw. The semantics of this function should be that once the server can confirm that processing of results can reasonably continue (cheap early checks like permission, the fact that the data is validly encoded before unpacking it, etc. should be performed) it returns.
To query whether the store operation has succeeded or not, a new function should be added, which returns status information (from the database) about the store. The information needed here is malleable, but at least a boolean: "Is the operation still in progress?". (Consuming a successful result might want to remove the related information from the database, to ease garbage collection times at startup.)
Implementation changes
The store command should, once received the token from the server, close the connection and use the token to every once in a while poll the server for the status of the operation. Deciding a good interval here could be tough, but trivial choices like "every 10 sec" or "every 30 sec" should be fine as a prototype. As far as I gathered, we already perform a counting of reports during store (which is weird!) but if this information is available, the initial wait time, and the requery interval could be assumed using it.
Inbetween queries, the store binary should sleep using OS primitives for sleeping a process, but without having to rely on the network stack. Every query is its own connection, like cmd ....
Obsoletes #4039.
store, more specifically the API call tomassStoreRunwaits and blocks until the result of the store is returned to the client. As processing a store action takes a non-trivial amount of time on the serverside (and this operation is also executed only on one thread!), this means that returning frommassStoreRunitself takes a non-trivial amount of time. The problem surfaces if the connection between the client and the server coughs, chokes, misbehaves, because it is only the networking stack in the kernel that is keeping the door open for the reply to arrive. While a disappearing client is no problem from the server's side and data won't be lost, CI jobs can hang indefinitely, or scripts that expect data to be available forcmdquery after a return fromstorewill break apart.The proposal is to switch the blocking from relying on externalia like "the TCP stack" into a softer, but more local, blocking mechanism, while also turning the API itself asynchronous. This proposal is backwards compatible.
Database changes
We already have information in
RunLockas to what runs are undergoing a store. However, this is not enough, we need to store some semi-temporary information about store "attempts" or "sessions". This could go into its own table, per product, as this needs to be kept for a time even if the run lock is released. This table would contain the run name, a unique session token/identifier, and some status flag. The identifier might be auto-incremented, or a hash of the time when the lock was initialised, it is not a "secret" resource.These identifiers should be garbage collected in the usual process.
CLI changes
There are no changes needed on the CLI. Optionally, the
storecommand might be extended with a--no-blockargument which makes it immediately exit and return to shell once the server started processing the data, in case the user does not care about when the operation finished.API changes
A new endpoint, hereby referred to as
massStoreRunAsyncshall be created. This function should return the aforementioned "store session token", or throw. The semantics of this function should be that once the server can confirm that processing of results can reasonably continue (cheap early checks like permission, the fact that the data is validly encoded before unpacking it, etc. should be performed) it returns.To query whether the store operation has succeeded or not, a new function should be added, which returns status information (from the database) about the store. The information needed here is malleable, but at least a boolean: "Is the operation still in progress?". (Consuming a successful result might want to remove the related information from the database, to ease garbage collection times at startup.)
Implementation changes
The
storecommand should, once received the token from the server, close the connection and use the token to every once in a while poll the server for the status of the operation. Deciding a good interval here could be tough, but trivial choices like "every 10 sec" or "every 30 sec" should be fine as a prototype. As far as I gathered, we already perform a counting of reports during store (which is weird!) but if this information is available, the initial wait time, and the requery interval could be assumed using it.Inbetween queries, the
storebinary should sleep using OS primitives for sleeping a process, but without having to rely on the network stack. Every query is its own connection, likecmd ....Obsoletes #4039.