Skip to content

feat(local): add fsync to LocalFileSystem for durability#643

Open
Barre wants to merge 1 commit intoapache:mainfrom
Barre:add-fsync-local-filesystem
Open

feat(local): add fsync to LocalFileSystem for durability#643
Barre wants to merge 1 commit intoapache:mainfrom
Barre:add-fsync-local-filesystem

Conversation

@Barre
Copy link

@Barre Barre commented Feb 15, 2026

Call sync_all() on written files and fsync parent directories at all write-path boundaries (put, copy, rename, multipart complete) so that a successful return guarantees data is durable on disk, matching the implicit contract of cloud object stores.

Rationale for this change

When LocalFileSystem::put (or copy/rename/multipart complete) returns Ok, callers reasonably expect the data to be durable on disk as this is the implicit contract of every cloud object store like S3 or GCS.

However, LocalFileSystem never called fsync/sync_all, meaning the OS was free to keep the data in its page cache indefinitely. A crash or power loss after a successful put could result in data loss or zero-length files.

This change adds sync_all() calls on written files and fsync on parent directories at every write-path boundary (put_opts, copy_opts, rename_opts, multipart complete), ensuring that when an operation returns success, both the file contents and the directory entry pointing to them are durable on stable storage.

Are there any user-facing changes?

No.

Call sync_all() on written files and fsync parent directories at all
write-path boundaries (put, copy, rename, multipart complete) so that
a successful return guarantees data is durable on disk, matching the
implicit contract of cloud object stores.
@crepererum
Copy link
Contributor

I think this is technically what we should do. I wonder however if we want to offer some kind of opt-out mechanism (I think in vein of "safety+sanity by default", it shouldn't be opt-in). WDYT @tustvold?

@tustvold
Copy link
Contributor

tustvold commented Feb 16, 2026

I think calling fsync on the files makes a lot of sense, and ensures that we maintain atomicity (or at least try do our best to).

Calling fsync on directories also makes sense, but I think this PR doesn't go far enough, it needs to ensure that any recursively created directories are also fsynced...

I think it makes sense for this to be enabled by default (in a future breaking release), but it should be possible to turn this behaviour off, perhaps with a separate option that just fsync's files, ensuring atomicity.

There is also the question to me of what happens if a process is writing lots of files to the same directory, fsyncing the directory on every new file is rather wasteful, you really want some mechanism for the caller to fsync as part of some higher-level transaction. I don't know how we would expose this though...

Perhaps it doesn't matter - LocalFileSystem inherently trades performance for being able to interoperate with a filesystem, if someone wants the optimal disk performance they're probably better off using io_uring and something custom anyway.

@alamb
Copy link
Contributor

alamb commented Mar 19, 2026

I don't think we should automatically call fsync on filesystem writes. Users who want that behavior can already call fsync when the write returns. An option (defaulting to false) would make sense for convenience

There is also the question to me of what happens if a process is writing lots of files to the same directory, fsyncing the directory on every new file is rather wasteful, you really want some mechanism for the caller to fsync as part of some higher-level transaction.

I think it be best put in a layer higher up in the call stack (whatever is calling / orchestrating this higher level transaction using the object_store crate)

@Barre
Copy link
Author

Barre commented Mar 20, 2026

Defaulting to false seems a bit reckless to me, users are most certainly expecting the local "object store" to be durable, as are all other object store implementations in this crate (aside from memory...).

It was definitely a great surprise to me to learn that it was not.

@adamreeve
Copy link
Contributor

Users who want that behavior can already call fsync when the write returns. An option (defaulting to false) would make sense for convenience

Is this true? You need a File to call fsync on, and you can't call fsync after the file is closed. The ObjectStore trait doesn't expose a way to get the file after a write as far as I can tell.

I think it be best put in a layer higher up in the call stack (whatever is calling / orchestrating this higher level transaction using the object_store crate)

For the same reason as above, I don't think this is really feasible.

@crepererum
Copy link
Contributor

crepererum commented Mar 20, 2026

I don't think we should automatically call fsync on filesystem writes. Users who want that behavior can already call fsync when the write returns.

They can, but it's a major hassle to do. As written above, just calling fsync on the file isn't enough, so to implement this properly, you also have to walk the directories and figure out which ones you need to fsync (technically only the ones where new files or sub-directories have been created). That to me seems unpractical. Hence I think object-store should implement this, at least as an option.

To avoid confusion: calling fsync on a directory does NOT recursively fsync the whole subtree. It really just fsync "what's in this directory at this specific level" + "what's the state of the directory, e.g. permissions and attributes".

@alamb
Copy link
Contributor

alamb commented Mar 20, 2026

Defaulting to false seems a bit reckless to me, users are most certainly expecting the local "object store" to be durable, as are all other object store implementations in this crate (aside from memory...).

It was definitely a great surprise to me to learn that it was not.

I agree that if we were writing this library from scratch, having writes to files automatically call fsync would be a very reasonable behavior.

Likewise, I am not debating that adding some way to call fsync is valuable, I am simply voting against changing the default behavior to avoid causing hard to track down downstream implications (though it enough people vote the other way I will be willing to defer).

I recommend we do this in two PRs:

  1. Add a new flag, default to false that adds fsync on write (there seems to be broad agreement that this is a useful feature)
  2. Propose a separate PR to change the default of the new flag to true where we can debate the implications of that change.

Is this true? You need a File to call fsync on, and you can't call fsync after the file is closed. The ObjectStore trait doesn't expose a way to get the file after a write as far as I can tell.

This is a good point @adamreeve -- I stand corrected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants