feat(local): add fsync to LocalFileSystem for durability#643
feat(local): add fsync to LocalFileSystem for durability#643Barre wants to merge 1 commit intoapache:mainfrom
Conversation
Call sync_all() on written files and fsync parent directories at all write-path boundaries (put, copy, rename, multipart complete) so that a successful return guarantees data is durable on disk, matching the implicit contract of cloud object stores.
|
I think this is technically what we should do. I wonder however if we want to offer some kind of opt-out mechanism (I think in vein of "safety+sanity by default", it shouldn't be opt-in). WDYT @tustvold? |
|
I think calling fsync on the files makes a lot of sense, and ensures that we maintain atomicity (or at least try do our best to). Calling fsync on directories also makes sense, but I think this PR doesn't go far enough, it needs to ensure that any recursively created directories are also fsynced... I think it makes sense for this to be enabled by default (in a future breaking release), but it should be possible to turn this behaviour off, perhaps with a separate option that just fsync's files, ensuring atomicity. There is also the question to me of what happens if a process is writing lots of files to the same directory, fsyncing the directory on every new file is rather wasteful, you really want some mechanism for the caller to fsync as part of some higher-level transaction. I don't know how we would expose this though... Perhaps it doesn't matter - LocalFileSystem inherently trades performance for being able to interoperate with a filesystem, if someone wants the optimal disk performance they're probably better off using io_uring and something custom anyway. |
|
I don't think we should automatically call fsync on filesystem writes. Users who want that behavior can already call
I think it be best put in a layer higher up in the call stack (whatever is calling / orchestrating this higher level transaction using the |
|
Defaulting to false seems a bit reckless to me, users are most certainly expecting the local "object store" to be durable, as are all other object store implementations in this crate (aside from memory...). It was definitely a great surprise to me to learn that it was not. |
Is this true? You need a
For the same reason as above, I don't think this is really feasible. |
They can, but it's a major hassle to do. As written above, just calling fsync on the file isn't enough, so to implement this properly, you also have to walk the directories and figure out which ones you need to fsync (technically only the ones where new files or sub-directories have been created). That to me seems unpractical. Hence I think object-store should implement this, at least as an option. To avoid confusion: calling fsync on a directory does NOT recursively fsync the whole subtree. It really just fsync "what's in this directory at this specific level" + "what's the state of the directory, e.g. permissions and attributes". |
I agree that if we were writing this library from scratch, having writes to files automatically call Likewise, I am not debating that adding some way to call I recommend we do this in two PRs:
This is a good point @adamreeve -- I stand corrected. |
Call sync_all() on written files and fsync parent directories at all write-path boundaries (put, copy, rename, multipart complete) so that a successful return guarantees data is durable on disk, matching the implicit contract of cloud object stores.
Rationale for this change
When LocalFileSystem::put (or copy/rename/multipart complete) returns Ok, callers reasonably expect the data to be durable on disk as this is the implicit contract of every cloud object store like S3 or GCS.
However, LocalFileSystem never called fsync/sync_all, meaning the OS was free to keep the data in its page cache indefinitely. A crash or power loss after a successful put could result in data loss or zero-length files.
This change adds sync_all() calls on written files and fsync on parent directories at every write-path boundary (put_opts, copy_opts, rename_opts, multipart complete), ensuring that when an operation returns success, both the file contents and the directory entry pointing to them are durable on stable storage.
Are there any user-facing changes?
No.