Skip to content

Fix TOCTOU race in SharedMemoryHelpers.CreateOrOpenFile on Linux#125524

Open
steveisok wants to merge 1 commit intodotnet:mainfrom
steveisok:fix/shared-memory-eexist-race
Open

Fix TOCTOU race in SharedMemoryHelpers.CreateOrOpenFile on Linux#125524
steveisok wants to merge 1 commit intodotnet:mainfrom
steveisok:fix/shared-memory-eexist-race

Conversation

@steveisok
Copy link
Member

Summary

Fix a TOCTOU (time-of-check-to-time-of-use) race condition in SharedMemoryHelpers.CreateOrOpenFile that causes intermittent IOException when two processes concurrently create a named mutex backed by shared memory on Linux.

The Bug

CreateOrOpenFile uses a two-step approach:

  1. Try Open(O_RDWR) to open an existing file
  2. If ENOENT, try Open(O_CREAT|O_EXCL) to create exclusively

The race:

  1. Process A: Open(O_RDWR) → ENOENT (file does not exist)
  2. Process B: creates the file successfully
  3. Process A: Open(O_CREAT|O_EXCL) → EEXIST (file now exists)
  4. Process A: throws IOException instead of handling EEXIST

This manifests as:

System.IO.IOException: The file '/tmp/.dotnet/shm/session1/NuGet-Migrations' already exists.
   at System.IO.SharedMemoryHelpers.CreateOrOpenFile(...)
   at System.Threading.Mutex..ctor(Boolean initiallyOwned, String name)
   at NuGet.Common.Migrations.MigrationRunner.Run(...)

The Fix

When the exclusive create fails with EEXIST, retry from the top (loop back to the Open(O_RDWR) which will now succeed since the file exists). Limited to one retry to prevent infinite loops.

Impact

This is a known flaky failure in CI (labeled Known Build Error) that has been open since September 2023. It affects any scenario where parallel dotnet processes run first-time setup, including:

  • VMR scenario tests (ValidateInstallers on Linux arm64)
  • Docker container first-run
  • CI environments with shared /tmp

Fixes #91987
Related: #80619, #76736

When two processes concurrently create a named mutex backed by a shared
memory file, the following race can occur:

1. Process A calls Open(O_RDWR) — returns ENOENT (file doesn't exist)
2. Process B creates the file
3. Process A calls Open(O_CREAT|O_EXCL) — returns EEXIST (file now exists)
4. Process A throws IOException: 'The file already exists'

This manifests as intermittent IOException crashes in NuGet's
MigrationRunner when parallel dotnet processes run first-time setup,
because NuGet.Common.Migrations.MigrationRunner uses a named mutex
('NuGet-Migrations') to synchronize.

The fix adds a single retry: when the exclusive create fails with
EEXIST, loop back to re-attempt the plain open, which will now succeed
since the file exists.

Fixes dotnet#91987

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 13, 2026 15:58
@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-io
See info in area-owners.md if you want to be subscribed.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a TOCTOU race in SharedMemoryHelpers.CreateOrOpenFile on Unix by retrying when the exclusive create fails with EEXIST, preventing intermittent IOException when multiple processes concurrently create shared-memory-backed mutex files.

Changes:

  • Wrap the open/create sequence in a retry loop.
  • On O_CREAT|O_EXCL returning EEXIST, retry the initial O_RDWR open once.

You can also share your feedback on Copilot code review. Take the survey.

// Retry loop to handle the TOCTOU race between the initial open attempt (which may return ENOENT)
// and the exclusive create attempt (which may return EEXIST if another process created the file
// in between). On EEXIST, we loop back and re-attempt the open.
for (int retries = 0; ; retries++)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going need a bit of reworking, but I think it is the correct approach. It looks like we'd need to elevate the Interop.ErrorInfo out of the loop and after the loop body move the throw Interop.GetExceptionForIoErrno(error, sharedMemoryFilePath); that is currently being preempted by the new continue logic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I figured the same thing. Felt it was good to push this up as a placeholder

Comment on lines +483 to +486
if (error.Error == Interop.Error.EEXIST && retries < 1)
{
continue;
}
Comment on lines +424 to +427
for (int retries = 0; ; retries++)
{
SafeFileHandle fd = Interop.Sys.Open(sharedMemoryFilePath, Interop.Sys.OpenFlags.O_RDWR | Interop.Sys.OpenFlags.O_CLOEXEC, 0);
Interop.ErrorInfo error = Interop.Sys.GetLastErrorInfo();
Comment on lines +481 to +486
// Another process created the file between our open and create attempts.
// Retry from the top to open the now-existing file.
if (error.Error == Interop.Error.EEXIST && retries < 1)
{
continue;
}
if (!fd.IsInvalid)
{
if (id.IsUserScope)
// Retry loop to handle the TOCTOU race between the initial open attempt (which may return ENOENT)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A different idea to consider. Please take it with a grain of salt as I don't have the context:

  • when createIfNotExist is true, attempt in a loop until fd.IsValid:
    • create it with O_CREAT | O_EXCL. A success means we just created it and are the owner.
    • if creation failed with EEXIST, try to open it without creation
    • if opening without creation failed with ENOENT, it means it got removed in the meantime. Continue the loop
  • attempt to open an existing file

Pseudocode:

UnixFileMode permissionsMask = id.IsUserScope
    ? PermissionsMask_OwnerUser_ReadWrite
    : PermissionsMask_AllUsers_ReadWrite;

const Interop.Sys.OpenFlags mandatoryFlags = Interop.Sys.OpenFlags.O_RDWR | Interop.Sys.OpenFlags.O_CLOEXEC
SafeFileHandle fd = new();
while (createIfNotExist && fd.IsInvalid)
{
    // Use O_EXCL which provides a guarantee that the file is created by this call.
    fd = Interop.Sys.Open(
        sharedMemoryFilePath,
        mandatoryFlags | Interop.Sys.OpenFlags.O_CREAT | Interop.Sys.OpenFlags.O_EXCL,
        (int)permissionsMask);

    if (fd.IsInvalid)
    {
        Interop.ErrorInfo error = Interop.Sys.GetLastErrorInfo();
        fd.Dispose();

        if (error.Error == Interop.Error.EEXIST)
        {
            fd = Interop.Sys.Open(sharedMemoryFilePath, mandatoryFlags, 0);
            if (fd.IsInvalid)
            {
                error = Interop.Sys.GetLastErrorInfo();
                fd.Dispose();

                // The file could have been deleted after the first open attempt, in which case we should retry creating the file.
                if (error.Error == Interop.Error.ENOENT)
                {
                    continue;
                }
            }
        }

        throw Interop.GetExceptionForIoErrno(error, sharedMemoryFilePath);
    }
    else
    {
        createdFile = true;
    }
}

if (fd.IsInvalid)
{
    Debug.Assert(!createIfNotExist);
    fd = Interop.Sys.Open(sharedMemoryFilePath, mandatoryFlags, 0);

    if (fd.IsInvalid)
    {
        Interop.ErrorInfo error = Interop.Sys.GetLastErrorInfo();
        fd.Dispose();
        throw Interop.GetExceptionForIoErrno(error, sharedMemoryFilePath);
    }
}

// If we got here, fd is a valid file handle.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this approach would work.

The main difference would be that we're optimizing (in terms of number of syscalls) for the create case here instead of the open existing case. Not that that really matters given all the other infrastructure here IMO.

Copy link
Member

@AaronRobinsonMSFT AaronRobinsonMSFT Mar 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some minor modifications to the current one is easier to grep than this one. @adamsitnik what prompted your suggestions is there some optimization inherent in this alternative approach?

Copy link
Member

@jkoritzinsky jkoritzinsky Mar 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that Adam's implementation, while a bigger diff from the existing code, is much easier to read and understand when looking at at a glance.

Adam's implementation also better follows Jared's advice (ie don't check for existence as existence may change. Just do the action and handle the errors when the file does not exist.
) https://blog.paranoidcoding.org/2009/12/10/the-file-system-is-unpredictable.html

Copy link
Member

@adamsitnik adamsitnik Mar 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some minor modifications to the current one is easier to grep than this one. @adamsitnik what prompted your suggestions is there some optimization inherent in this alternative approach?

The problem we are trying to solve is very similar to some File.OpenHandle issues (and this is an API I own) and I personally prefer to use O_EXCL when dealing with TOCTOU because it's atomic.

But again, as I wrote I don't have the context (is the most common case to create the file? or open an existing one?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and this is an API I own

Then we can absolutely go in that direction.

But again, as I wrote I don't have the context (is the most common case to create the file? or open an existing one?)

I don't have enough information to answer that either. I agree it does look like a pessimization with respect to existing paths, but as @jkoritzinsky points out there is a ton of machinery here so I'm not sure how much that matters.

Do we have any of these APIs in the perf lab? Is a there a microbenchmark we could run locally that would help see the cost?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any of these APIs in the perf lab? Is a there a microbenchmark we could run locally that would help see the cost?

If we don't, we should add it.

@jkoritzinsky
Copy link
Member

Really glad to see that moving the shared mutex logic to managed has helped us solve at least one bug!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The system cannot open the device or file specified 'NuGet-Migrations'

5 participants