Skip to content

Add a generic abstraction for filesystem based stores#3364

Merged
fahedouch merged 3 commits intocontainerd:mainfrom
apostasie:hardening-filestore
Sep 5, 2024
Merged

Add a generic abstraction for filesystem based stores#3364
fahedouch merged 3 commits intocontainerd:mainfrom
apostasie:hardening-filestore

Conversation

@apostasie
Copy link
Copy Markdown
Contributor

@apostasie apostasie commented Aug 26, 2024

Although our codebase makes use of flock when we manipulate the filesystem (normally...) , we have had a relatively significant number of issues in the past with our various "file stores". They typically fall in the following categories:

  • unprotected concurrent access to the same resource
  • filesystem details leaking upward into higher level abstractions where they are manipulated unsafely
  • lock that does not really lock
  • lock that should lock over a larger section of code
  • sequential (individually locked) operations being done assuming that data cannot change in-between them
  • partial writes to the filesystem

Issues arising from these are generally heinous to debug, as they usually manifest in a flaky way (as they may only happen in high-concurrency situations / under pressure), or tend to bubble-up in different places making seemingly unrelated operations fail in cryptic ways.

Truth is, it is not 100% trivial to achieve a concurrency-safe and atomic filesystem storage system, even a very simple one like what we do - operating system crash or restarts, or generally interrupted writes, for example, are often overlooked as something that has to be thought about, to boot.

While we have fixed a number of these issues, there is clearly more lingering, and we are bound to produce even more in the future short of having a stronger basis.
I do not think it is fair to assume that everyone will be very mindful and intentional with their locking, or that the filesystem is a good enough, "safe" abstraction...

Furthermore, and even when done right, these things are inherently repetitive (mutex / filesystem lock / path sanitization / atomic write) and we now have enough of these "filestores" to justify having a shared base:

  • namestore
  • volumestore
  • lifecycledata
  • hostsstore

This PR first commit does introduce an interface and a reference (filesystem) implementation of a reusable Store component that every "store" can leverage and that provides:

  • concurrency safety (thread safe and safe across concurrent invocations of the binary)
  • atomic operations (write will never leave incomplete data in a file)
  • path components validation, per-platform
  • a small set of simple methods that should cover most use-cases
  • completely sealed underlying implementation that does not leak
  • proper, detailed errors
  • some level of locking enforcement (eg: will error if there is no lock)
  • tests

Structs that leverage it are thus freed from having to deal with mutex / filelocks, or underlying implementations (eg: filesystem). They can rely on a simple API that is responsible for just that, and can focus solely on the abstraction they have to provide for their specific data.

Hopefully, if things go our way, we should never see again a bug report about hosed json files / broken systems on restart, or flakyness involving stores...

The second commit is a mere cleanup of the (useless) mock for the volume store

The third commit is a refactor of our various stores to leverage that:

  • all four stores have been reviewed and their syntax harmonized
  • better erroring
  • code duplication removal
  • all explicit locking mechanisms have been removed (to simplify the user life) (with the exception of the VolumeStore) and all stores now expose methods tailored for their use cases that lock on their own
  • all use cases have been reviewed to make sure nothing is racy (exceptions with Composer, obviously)
  • ensuring locking is done at the namespace level, or lower

Incidentally, the following known issues are getting fixed or partially fixed:

Finally, I do appreciate this is a far reaching PR, as it is touching code in the critical path of basically everything.
To a large extent, though, this is mere refactoring and literally no novelty.

@apostasie apostasie marked this pull request as draft August 26, 2024 05:54
@apostasie apostasie force-pushed the hardening-filestore branch 14 times, most recently from d500496 to e7bf627 Compare August 27, 2024 04:00
Comment thread cmd/nerdctl/namespace.go
if err != nil {
log.L.Warn(err)
} else {
volEnts, err := volStore.List(false)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List() is expensive.
Replaced with the much lighter Count.

return false, volGetErr
}
// FIXME: this is racy. See note in up_volume.go
options.VolumeExists = volStore.Exists
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is now a Exists method, much cheaper than Get

Comment thread pkg/cmd/container/remove.go
Comment thread pkg/cmd/volume/prune.go
if err != nil {
return err
}
err = volStore.Lock()
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replacing all of that with a volStore.Prune(func) that does locking.

@@ -1,41 +0,0 @@
/*
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved inside the store.
These things are restrictions linked to the filesystem implementation, and not conceptually to restrictions we necessarily want to apply to identifiers.
Put otherwise: they are orthogonal concerns.

@@ -0,0 +1,37 @@
/*
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This replace the whole uber go mock thing for mountutil.

@apostasie apostasie force-pushed the hardening-filestore branch 9 times, most recently from 03d3241 to 1a7f3d0 Compare August 28, 2024 00:27
copyFileContent("/etc/resolv.conf", resolvConfPath)

etcHostsPath, err := hostsstore.AllocHostsFile(dataStore, m.globalOptions.Namespace, containerID)
hs, err := hostsstore.New(dataStore, m.globalOptions.Namespace)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Write only once, instead of allocating first, then writing.

return fmt.Errorf("identifier %q must match pattern %q: %w", s, AllowedIdentfierChars, errdefs.ErrInvalidArgument)
}

if err := validatePlatformSpecific(s); err != nil {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Platform validation of identifiers is a filesystem concern.
These are moved into the store.Store, as they are orthogonal concerns.

Comment thread cmd/nerdctl/volume_remove_linux_test.go Outdated
Comment thread pkg/cmd/volume/rm.go
@apostasie
Copy link
Copy Markdown
Contributor Author

Rebase broke some tests.
Investigating.

@apostasie apostasie force-pushed the hardening-filestore branch 2 times, most recently from 073c2cd to 0d4485f Compare September 1, 2024 19:17
@apostasie
Copy link
Copy Markdown
Contributor Author

  • rebased
  • removed commented out extra stuff
  • changed volStore.Remove semantics as discussed above to allow for locking of a function provided as a generator

CI is green.

@apostasie apostasie force-pushed the hardening-filestore branch 2 times, most recently from 09a7aa9 to 60c68b9 Compare September 2, 2024 17:23
@apostasie
Copy link
Copy Markdown
Contributor Author

Rebased. Pending green CI.

@AkihiroSuda AkihiroSuda added this to the v2.0.0 milestone Sep 3, 2024
@AkihiroSuda
Copy link
Copy Markdown
Member

GroupRename() seems added (and commented out) in the first commit and then removed in the third commit.
Probably this should be just removed from the first commit

@apostasie
Copy link
Copy Markdown
Contributor Author

GroupRename() seems added (and commented out) in the first commit and then removed in the third commit. Probably this should be just removed from the first commit

My bad. Thanks for catching it.
Last push corrects it.

(CI failing on Hub being 500)

Comment thread pkg/cmd/container/remove.go
Comment thread pkg/cmd/volume/prune.go
Comment thread pkg/store/store.go Outdated
@fahedouch
Copy link
Copy Markdown
Member

fahedouch commented Sep 3, 2024

big UP for this PR 👍 , I have two points to raise here:

  • please avoid huge PR, it is really hard to review
  • we might need a document for this abstraction (e.g. big picture abstraction design)

@apostasie
Copy link
Copy Markdown
Contributor Author

big UP for this PR 👍 , I have two points to raise here:

  • please avoid huge PR, it is really hard to review
  • we might need a document for this abstraction (e.g. big picture abstraction design)

Thanks a lot @fahedouch
Really appreciate the sharp eye review
On my phone now but will address your comments momentarily

ack on splitting in more commits
Will definitely do in upcoming work

Signed-off-by: apostasie <spam_blackhole@farcloser.world>
Signed-off-by: apostasie <spam_blackhole@farcloser.world>
Signed-off-by: apostasie <spam_blackhole@farcloser.world>
Copy link
Copy Markdown
Member

@AkihiroSuda AkihiroSuda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

Copy link
Copy Markdown
Member

@fahedouch fahedouch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@fahedouch fahedouch merged commit 678393e into containerd:main Sep 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Containers across namespaces boundary show up in /etc/hosts

3 participants