Fix TOCTOU race in SharedMemoryHelpers.CreateOrOpenFile on Linux#125524
Fix TOCTOU race in SharedMemoryHelpers.CreateOrOpenFile on Linux#125524steveisok wants to merge 1 commit intodotnet:mainfrom
Conversation
When two processes concurrently create a named mutex backed by a shared
memory file, the following race can occur:
1. Process A calls Open(O_RDWR) — returns ENOENT (file doesn't exist)
2. Process B creates the file
3. Process A calls Open(O_CREAT|O_EXCL) — returns EEXIST (file now exists)
4. Process A throws IOException: 'The file already exists'
This manifests as intermittent IOException crashes in NuGet's
MigrationRunner when parallel dotnet processes run first-time setup,
because NuGet.Common.Migrations.MigrationRunner uses a named mutex
('NuGet-Migrations') to synchronize.
The fix adds a single retry: when the exclusive create fails with
EEXIST, loop back to re-attempt the plain open, which will now succeed
since the file exists.
Fixes dotnet#91987
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Tagging subscribers to this area: @dotnet/area-system-io |
There was a problem hiding this comment.
Pull request overview
Fixes a TOCTOU race in SharedMemoryHelpers.CreateOrOpenFile on Unix by retrying when the exclusive create fails with EEXIST, preventing intermittent IOException when multiple processes concurrently create shared-memory-backed mutex files.
Changes:
- Wrap the open/create sequence in a retry loop.
- On
O_CREAT|O_EXCLreturningEEXIST, retry the initialO_RDWRopen once.
You can also share your feedback on Copilot code review. Take the survey.
| // Retry loop to handle the TOCTOU race between the initial open attempt (which may return ENOENT) | ||
| // and the exclusive create attempt (which may return EEXIST if another process created the file | ||
| // in between). On EEXIST, we loop back and re-attempt the open. | ||
| for (int retries = 0; ; retries++) |
There was a problem hiding this comment.
This is going need a bit of reworking, but I think it is the correct approach. It looks like we'd need to elevate the Interop.ErrorInfo out of the loop and after the loop body move the throw Interop.GetExceptionForIoErrno(error, sharedMemoryFilePath); that is currently being preempted by the new continue logic.
There was a problem hiding this comment.
Cool, I figured the same thing. Felt it was good to push this up as a placeholder
| if (error.Error == Interop.Error.EEXIST && retries < 1) | ||
| { | ||
| continue; | ||
| } |
| for (int retries = 0; ; retries++) | ||
| { | ||
| SafeFileHandle fd = Interop.Sys.Open(sharedMemoryFilePath, Interop.Sys.OpenFlags.O_RDWR | Interop.Sys.OpenFlags.O_CLOEXEC, 0); | ||
| Interop.ErrorInfo error = Interop.Sys.GetLastErrorInfo(); |
| // Another process created the file between our open and create attempts. | ||
| // Retry from the top to open the now-existing file. | ||
| if (error.Error == Interop.Error.EEXIST && retries < 1) | ||
| { | ||
| continue; | ||
| } |
| if (!fd.IsInvalid) | ||
| { | ||
| if (id.IsUserScope) | ||
| // Retry loop to handle the TOCTOU race between the initial open attempt (which may return ENOENT) |
There was a problem hiding this comment.
A different idea to consider. Please take it with a grain of salt as I don't have the context:
- when
createIfNotExistis true, attempt in a loop untilfd.IsValid:- create it with O_CREAT | O_EXCL. A success means we just created it and are the owner.
- if creation failed with
EEXIST, try to open it without creation - if opening without creation failed with ENOENT, it means it got removed in the meantime. Continue the loop
- attempt to open an existing file
Pseudocode:
UnixFileMode permissionsMask = id.IsUserScope
? PermissionsMask_OwnerUser_ReadWrite
: PermissionsMask_AllUsers_ReadWrite;
const Interop.Sys.OpenFlags mandatoryFlags = Interop.Sys.OpenFlags.O_RDWR | Interop.Sys.OpenFlags.O_CLOEXEC
SafeFileHandle fd = new();
while (createIfNotExist && fd.IsInvalid)
{
// Use O_EXCL which provides a guarantee that the file is created by this call.
fd = Interop.Sys.Open(
sharedMemoryFilePath,
mandatoryFlags | Interop.Sys.OpenFlags.O_CREAT | Interop.Sys.OpenFlags.O_EXCL,
(int)permissionsMask);
if (fd.IsInvalid)
{
Interop.ErrorInfo error = Interop.Sys.GetLastErrorInfo();
fd.Dispose();
if (error.Error == Interop.Error.EEXIST)
{
fd = Interop.Sys.Open(sharedMemoryFilePath, mandatoryFlags, 0);
if (fd.IsInvalid)
{
error = Interop.Sys.GetLastErrorInfo();
fd.Dispose();
// The file could have been deleted after the first open attempt, in which case we should retry creating the file.
if (error.Error == Interop.Error.ENOENT)
{
continue;
}
}
}
throw Interop.GetExceptionForIoErrno(error, sharedMemoryFilePath);
}
else
{
createdFile = true;
}
}
if (fd.IsInvalid)
{
Debug.Assert(!createIfNotExist);
fd = Interop.Sys.Open(sharedMemoryFilePath, mandatoryFlags, 0);
if (fd.IsInvalid)
{
Interop.ErrorInfo error = Interop.Sys.GetLastErrorInfo();
fd.Dispose();
throw Interop.GetExceptionForIoErrno(error, sharedMemoryFilePath);
}
}
// If we got here, fd is a valid file handle.There was a problem hiding this comment.
I think this approach would work.
The main difference would be that we're optimizing (in terms of number of syscalls) for the create case here instead of the open existing case. Not that that really matters given all the other infrastructure here IMO.
There was a problem hiding this comment.
I think some minor modifications to the current one is easier to grep than this one. @adamsitnik what prompted your suggestions is there some optimization inherent in this alternative approach?
There was a problem hiding this comment.
I think that Adam's implementation, while a bigger diff from the existing code, is much easier to read and understand when looking at at a glance.
Adam's implementation also better follows Jared's advice (ie don't check for existence as existence may change. Just do the action and handle the errors when the file does not exist.
) https://blog.paranoidcoding.org/2009/12/10/the-file-system-is-unpredictable.html
There was a problem hiding this comment.
I think some minor modifications to the current one is easier to grep than this one. @adamsitnik what prompted your suggestions is there some optimization inherent in this alternative approach?
The problem we are trying to solve is very similar to some File.OpenHandle issues (and this is an API I own) and I personally prefer to use O_EXCL when dealing with TOCTOU because it's atomic.
But again, as I wrote I don't have the context (is the most common case to create the file? or open an existing one?)
There was a problem hiding this comment.
and this is an API I own
Then we can absolutely go in that direction.
But again, as I wrote I don't have the context (is the most common case to create the file? or open an existing one?)
I don't have enough information to answer that either. I agree it does look like a pessimization with respect to existing paths, but as @jkoritzinsky points out there is a ton of machinery here so I'm not sure how much that matters.
Do we have any of these APIs in the perf lab? Is a there a microbenchmark we could run locally that would help see the cost?
There was a problem hiding this comment.
Do we have any of these APIs in the perf lab? Is a there a microbenchmark we could run locally that would help see the cost?
If we don't, we should add it.
|
Really glad to see that moving the shared mutex logic to managed has helped us solve at least one bug! |
Summary
Fix a TOCTOU (time-of-check-to-time-of-use) race condition in
SharedMemoryHelpers.CreateOrOpenFilethat causes intermittentIOExceptionwhen two processes concurrently create a named mutex backed by shared memory on Linux.The Bug
CreateOrOpenFileuses a two-step approach:Open(O_RDWR)to open an existing fileOpen(O_CREAT|O_EXCL)to create exclusivelyThe race:
Open(O_RDWR)→ ENOENT (file does not exist)Open(O_CREAT|O_EXCL)→ EEXIST (file now exists)This manifests as:
The Fix
When the exclusive create fails with EEXIST, retry from the top (loop back to the
Open(O_RDWR)which will now succeed since the file exists). Limited to one retry to prevent infinite loops.Impact
This is a known flaky failure in CI (labeled
Known Build Error) that has been open since September 2023. It affects any scenario where paralleldotnetprocesses run first-time setup, including:ValidateInstallerson Linux arm64)/tmpFixes #91987
Related: #80619, #76736