Fix memory exhaustion in TAR header auto-detection #1024

Copilot · 2025-11-19T11:04:50Z

During auto-detection without extension hints, random bytes in compressed files (e.g., tar.lz) can be misinterpreted as TAR LongName/LongLink headers with multi-gigabyte sizes, causing memory exhaustion.

Changes

Added size validation in TarHeader.ReadLongName()
- Introduced MAX_LONG_NAME_SIZE constant (32KB) - covers real-world path limits
- Validates size before allocation: if (size < 0 || size > MAX_LONG_NAME_SIZE)
- Throws InvalidFormatException (caught by IsTarFile, returns false, auto-detection continues)
Added regression test
- Tar_Malformed_LongName_Excessive_Size creates malformed header with 8GB size
- Verifies graceful failure instead of memory exhaustion

Example

// Before: Could attempt to allocate 8GB+ when detecting tar.lz
using var stream = File.OpenRead("archive.tar.lz");
using var reader = ReaderFactory.Open(stream);  // OutOfMemoryException

// After: Validation prevents excessive allocation, detection succeeds
using var stream = File.OpenRead("archive.tar.lz");
using var reader = ReaderFactory.Open(stream);  // Works correctly

All existing tests pass. No breaking changes.

Original prompt

This section details on the original issue you should resolve

<issue_title>Bug: Memory exhaustion when auto-detecting a specific tar.lz archive</issue_title>
<issue_description>### Summary

When reading a specific .tar.lz file without providing an extension hint, the library attempts to auto-detect the format. This process incorrectly identifies the file as a Tar archive with a LongLink header, leading to an attempt to allocate a massive amount of memory (e.g., 20GB). This causes the application to either crash or fail to open the archive. Standard compression utilities can open this same file without any issues.

The root cause appears to be a lack of validation in TarHeader.Read() and its helper methods.

Steps to Reproduce

Use the library to open a specially crafted .tar.lz file.
Do not specify ReaderOptions.ExtensionHint, forcing the library to auto-detect the archive type.
The library will fail to open the file after a massive memory spike.

Root Cause Analysis

The problem occurs because the auto-detection mechanism first tries to parse the file as a standard Tar archive. My file is a .tar.lz, but a byte at a specific offset is misinterpreted.

In TarHeader.Read(), the code enters a loop to process headers.

internal bool Read(BinaryReader reader)
{
    string? longName = null;
    string? longLinkName = null;
    var hasLongValue = true;
    byte[] buffer;
    EntryType entryType;

    do
    {
        buffer = ReadBlock(reader);

        if (buffer.Length == 0)
        {
            return false;
        }

        entryType = ReadEntryType(buffer);

        // In my file, the byte at offset 157 is misinterpreted as EntryType.LongLink
        if (entryType == EntryType.LongName)
        {
            longName = ReadLongName(reader, buffer); // <- THIS LINE
            continue;
        }
        else if (entryType == EntryType.LongLink)
        {
            longLinkName = ReadLongName(reader, buffer); // <- THIS LINE
            continue;
        }

        hasLongValue = false;
    } while (hasLongValue);
//...
}

For my specific file, the byte at offset 157 (read as entryType) happens to match EntryType.LongLink. This triggers a call to TarHeader.ReadLongName().

Inside ReadLongName(), the ReadSize(buffer) method calculates an extremely large value for nameLength based on the misinterpreted header data. The subsequent call to reader.ReadBytes(nameLength) attempts to allocate a massive array without any sanity checks.

private string ReadLongName(BinaryReader reader, byte[] buffer)
{
    var size = ReadSize(buffer); // Calculates a huge size
    var nameLength = (int)size;
    var nameBytes = reader.ReadBytes(nameLength); // <- ATTEMPTS HUGE ALLOCATION
    var remainingBytesToRead = BLOCK_SIZE - (nameLength % BLOCK_SIZE);

    // ...
    return ArchiveEncoding.Decode(nameBytes, 0, nameBytes.Length).TrimNulls();
}

The BinaryReader.ReadBytes() method directly allocates memory based on the provided count.

public virtual byte[] ReadBytes(int count)
{
    ArgumentOutOfRangeException.ThrowIfNegative(count);
    ThrowIfDisposed();

    if (count == 0)
    {
        return Array.Empty<byte>();
    }

    byte[] result = new byte[count]; // <- HUGE MEMORY ALLOCATION HAPPENS HERE
    int numRead = _stream.ReadAtLeast(result, result.Length, throwOnEndOfStream: false);

    // ...
    return result;
}

Stream Corruption

After the Tar parsing attempt fails (likely due to an EndOfStreamException or I/O error from Stream.ReadAtLeast()), the underlying Stream or SharpCompressStream appears to be left in a corrupted state.

When the auto-detection logic proceeds to the correct tar.lz format, it fails to read the header correctly. For example, it does not see the "LZIP" magic bytes at the beginning of the stream, even though debugging shows the bytes are present in the buffer. This strongly suggests that the stream's internal position or state has been irrecoverably altered by the failed read attempt.

Workaround

The issue can be avoided by explicitly setting ReaderOptions.ExtensionHint to guide the parser. This skips the problematic Tar auto-detection step.

// Example workaround
var options = new ReaderOptions { ExtensionHint = "tar.lz" };
using (var archive = ArchiveFactory.Open(filePath, options))
{
    // ...
}

However, most users would expect the auto-detection to be robust and would not think to set this option unless they have investigated the source code.</issue_description>

Comments on the Issue (you are @copilot in this section)

@adamhathcock Please make a P...

Fixes Bug: Memory exhaustion when auto-detecting a specific tar.lz archive #1021

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Co-authored-by: adamhathcock <527620+adamhathcock@users.noreply.github.com>

adamhathcock

Smells right

Copilot

Pull Request Overview

This PR fixes a critical security vulnerability where malformed TAR headers with excessively large LongName/LongLink sizes could cause memory exhaustion during archive auto-detection, particularly when reading compressed TAR files without extension hints.

Key Changes:

Added size validation (32KB limit) in TarHeader.ReadLongName() to prevent memory exhaustion attacks
Implemented regression test that verifies graceful failure with malformed 8GB-sized headers
Generalized .gitignore patterns for test scratch directories

Reviewed Changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated no comments.

File	Description
src/SharpCompress/Common/Tar/Headers/TarHeader.cs	Added MAX_LONG_NAME_SIZE constant and validation logic to prevent excessive memory allocation when reading LongName/LongLink headers
tests/SharpCompress.Test/Tar/TarReaderTests.cs	Added regression test that creates a malformed TAR header with 8GB size and verifies it throws IncompleteArchiveException instead of OutOfMemoryException
.gitignore	Generalized Scratch directory patterns to cover subdirectories (tests/TestArchives//Scratch and tests/TestArchives//Scratch2)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Initial plan

0698031

Copilot AI assigned Copilot and adamhathcock Nov 19, 2025

Copilot started work on behalf of adamhathcock November 19, 2025 11:06 View session

Add validation to prevent memory exhaustion in TAR long name headers

7b06652

Co-authored-by: adamhathcock <527620+adamhathcock@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Fix memory exhaustion when auto-detecting tar.lz archive~~ Fix memory exhaustion in TAR header auto-detection Nov 19, 2025

Copilot AI requested a review from adamhathcock November 19, 2025 11:26

Copilot finished work on behalf of adamhathcock November 19, 2025 11:26

adamhathcock approved these changes Nov 19, 2025

View reviewed changes

adamhathcock marked this pull request as ready for review November 19, 2025 15:17

Copilot AI review requested due to automatic review settings November 19, 2025 15:17

Copilot started reviewing on behalf of adamhathcock November 19, 2025 15:17 View session

Copilot finished reviewing on behalf of adamhathcock November 19, 2025 15:20

Copilot AI reviewed Nov 19, 2025

View reviewed changes

adamhathcock merged commit 2b4da7e into master Nov 19, 2025
12 of 13 checks passed

adamhathcock deleted the copilot/fix-memory-exhaustion-bug branch November 19, 2025 15:33

dependabot bot mentioned this pull request Dec 15, 2025

chore: Bump SharpCompress from 0.41.0 to 0.42.1 aweXpect/aweXpect#869

Merged

This was referenced Dec 22, 2025

Bump SharpCompress from 0.39.0 to 0.42.1 AssetRipper/Tpk#114

Closed

Bump SharpCompress from 0.40.0 to 0.42.1 zencq/libNOM.io#119

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix memory exhaustion in TAR header auto-detection #1024

Fix memory exhaustion in TAR header auto-detection #1024

Uh oh!

Copilot AI commented Nov 19, 2025 •

edited

Loading

Uh oh!

adamhathcock left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix memory exhaustion in TAR header auto-detection #1024

Fix memory exhaustion in TAR header auto-detection #1024

Uh oh!

Conversation

Copilot AI commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Example

Steps to Reproduce

Root Cause Analysis

Stream Corruption

Workaround

Comments on the Issue (you are @copilot in this section)

Uh oh!

adamhathcock left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Nov 19, 2025 •

edited

Loading