Skip to content

One shot pem reader and writer#32260

Merged
bartonjs merged 41 commits intodotnet:masterfrom
vcsjones:one-shot-pem
Mar 22, 2020
Merged

One shot pem reader and writer#32260
bartonjs merged 41 commits intodotnet:masterfrom
vcsjones:one-shot-pem

Conversation

@vcsjones
Copy link
Member

Closes #29588

I thought I would give this one a shot.

/cc @bartonjs

@Dotnet-GitSync-Bot
Copy link
Collaborator

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, to please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

@vcsjones
Copy link
Member Author

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, to please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

Well, I tried.

Range postebLineRange = lineRange;
int postebLinePosition = preebLinePosition;

// in lax decoding a posteb may appear on the same line as the preeb.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wasn't actually immediately clear to me from the RFC, if we even want to support single-line PEMs.

If not, this code can get a little simpler but I suspect someone will wish it "just worked".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I think we're supporting lax base64 (we don't require that there be exactly 64 base64 characters per line with no whitespace other than newlines) it seems like we want the full laxtextualmsg treatment, so no newlines are required anywhere

   laxtextualmsg    = *W preeb
                      laxbase64text
                      posteb *W


  preeb      = "-----BEGIN " label "-----" ; unlike [RFC1421] (A)BNF,
                                           ; eol is not required (but
  posteb     = "-----END " label "-----"   ; see [RFC1421], Section 4.4)

   W                = WSP / CR / LF / %x0B / %x0C           ; whitespace

   laxbase64text    = *(W / base64char) [base64pad *W [base64pad *W]]

Presumably, removing all newline processing (and ReadLine) and just using IndexOf to find preeb/posteb would actually make things simpler?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably, removing all newline processing (and ReadLine) and just using IndexOf to find preeb/posteb would actually make things simpler?

It still behaved as lax, but agree that it isn't necessary to handle line-by-line. I will address. It seems I may have misunderstood the whitespace requirement around the preeb / posteb for lax.

@vcsjones
Copy link
Member Author

Looking at failures.

@vcsjones
Copy link
Member Author

OSX failures are #702.

continue;
}

int postebLength = Posteb.Length + label.Length + Ending.Length;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this overflow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. label's length is going to be based on a found label in the PEM, and we've already checked for the "-----BEGIN " and "-----", so we know that the ROS (whose max length is an Int32) contains something of this length.

for (int index = 1; index < data.Length; index++)
{
char c = data[index];
if (!IsLabelChar(c) && c != ' ' && c != '-')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant c != '-' check?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. IsLabelChar says "Is this a valid character, except for "-"? The first character in allowed to be anything except a space or hyphen. The other characters may be any of those characters, including the hyphen or space. So it's a double negation. I'll try and make this easier to follow.

int precedingWhiteSpace = 0;
int trailingWhiteSpace = 0;
(int offset, int length) = content.GetOffsetAndLength(data.Length);
for (int index = offset; index < offset + length; index++)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how important perf is here, but I suspect this validation is quite slow (and doing the actual encoding using the Base64 APIs built on top of avx2/Ssse3 might actually be faster).

Alternatively, we could use some of the Base64 implementation as inspiration to make this method faster.

How large can length be?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I suspect this validation is quite slow

Yeah, it occurred to me but when I looked around for inspiration such as at

private static unsafe int FromBase64_ComputeResultLength(char* inputPtr, int inputLength)

I see it still doing character-by-character examination to account for various whitespace (albeit with unsafe tricks), just that the method has a looser definition of whitespace than the RFC.

Is there a better place to be looking here for this?

How large can length be?

It could be very large, but the typical use-case for PEM is PKI and CMS. CMS can be several thousand characters.

Copy link
Contributor

@lpereira lpereira Feb 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validation could be moved to System.Buffers.Text.Base64. It is already vectorized, and contains a (private) decoding map that can be used to efficiently validate the input and still provide a scalar fallback. The functions that decode using AVX2 and SSSE3 already check for invalid input before decoding, so it should be trivial.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would need to make that validation "whitespace" aware. The existing S.B.T.Base64 APIs do not support ignoring whitespace characters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, something like this (which includes the work of actually decoding) is ~2x faster than doing the linear validation (for a ~1k buffer without any leading/trailing whitespace). I am not necessarily suggesting that this is the way to go (especially given the unnecessary work it does and is no where near "optimal"), but at the very least it avoids having to provide your own custom logic for the base64 validation. Ideally, this type of API is built-in to either the Convert or Base64 classes.

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.19041
Intel Core i7-6700 CPU 3.40GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-alpha1-015914
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.19.56303, CoreFX 5.0.19.56306), X64 RyuJIT
  Job-XEOKAL : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Toolchain=CoreRun  MaxIterationCount=10  
MinIterationCount=5  WarmupCount=3  
Method Mean Error StdDev Median Min Max Ratio Gen 0 Gen 1 Gen 2 Allocated
ValidateOriginal 6.308 us 0.1829 us 0.1088 us 6.316 us 6.186 us 6.548 us 1.00 - - - -
ValidateNew 3.290 us 0.0724 us 0.0431 us 3.302 us 3.214 us 3.352 us 0.52 - - - -
private static readonly char[] s_whitespace = new char[] { ' ', '\t', '\r', '\n' };
private static readonly char[] s_base64 = new char[]
{
    '0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
    'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h' ,'i', 'j',
    'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r' ,'s', 't',
    'u', 'v', 'w', 'x', 'y', 'z', 'A', 'B' ,'C', 'D',
    'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L' ,'M', 'N',
    'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V' ,'W', 'X',
    'Y', 'Z', '+', '/'
};

private bool IsValidBase64(out int decodedSize)
{
    ReadOnlySpan<char> input = _base64String.AsSpan();

    int startIndex = input.IndexOfAny(s_base64);

    if (startIndex == -1)
    {
        startIndex = 0;
    }

    input = input.Slice(startIndex);

    int finalIndex = input.LastIndexOfAny(s_whitespace);

    if (finalIndex == -1)
    {
        finalIndex = input.Length;
    }

    input = input.Slice(0, finalIndex);

    byte[] buffer = ArrayPool<byte>.Shared.Rent(Base64.GetMaxDecodedFromUtf8Length(input.Length));

    bool result = Convert.TryFromBase64Chars(input, buffer, out decodedSize);
    if (!result)
    {
        decodedSize = 0;
    }

    ArrayPool<byte>.Shared.Return(buffer, clearArray: true);

    return result;
}

namespace System.Security.Cryptography
{
/// <summary>
/// RFC-7468 PEM (Privacy-Enhanced Mail) parsing and encoding.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// RFC-7468 PEM (Privacy-Enhanced Mail) parsing and encoding.
/// Provides methods for reading and writing the IETF RFC 7468 subset of PEM (Privacy-Enhanced Mail)
/// textual encodings. This class cannot be inherited.

public static class PemEncoding
{
private const string Preeb = "-----BEGIN ";
private const string Posteb = "-----END ";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these should be PreEB and PostEB, since EB is "encapsulation boundary". Perhaps even PreEBPrefix/PostEBPrefix

/// Finds the first PEM-encoded data.
/// </summary>
/// <param name="pemData">
/// A span containing the PEM encoded data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// A span containing the PEM encoded data.
/// The text containing the PEM-encoded data.

/// A span containing the PEM encoded data.
/// </param>
/// <exception cref="ArgumentException">
/// <paramref name="pemData"/> does not contain a well-formed PEM encoded value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// <paramref name="pemData"/> does not contain a well-formed PEM encoded value.
/// <paramref name="pemData"/> does not contain a well-formed PEM-encoded value.

/// <paramref name="pemData"/> does not contain a well-formed PEM encoded value.
/// </exception>
/// <returns>
/// A <see cref="PemFields"/> structure that contains the location, label, and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// A <see cref="PemFields"/> structure that contains the location, label, and
/// A value that specifies the location, label, and

[Theory]
[InlineData("\tOOPS")]
[InlineData(" OOPS")]
[InlineData("-OOPS")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[InlineData("-OOPS")]
[InlineData("-OOPS")]
[InlineData("OO PS")]

}

[Fact]
public static void TryFind_False_PrecedingLinesAndSignificantCharsBeforePreeb()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess even under full lax, the character before the starting - of the pre-encapsulation boundary must not exist or be whitespace.

[Fact]
public static void TryFind_False_ContentOnPostEbLine()
{
string content = "-----BEGIN TEST-----\nZm9v\n-----END TEST-----boop";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I accept this interpretation, even under full lax.

[Fact]
public static void TryFind_False_IncompletePostEncapBoundary()
{
string content = "-----BEGIN TEST-----\nZm9v\n-----END TEST";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I saw a test, one way or the other, on mismatched labels between preeb and posteb.

public static void TryFind_False_NoPostEncapBoundary()
{
string content = "-----BEGIN TEST-----\nZm9v\n";
Assert.False(PemEncoding.TryFind(content, out _));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we want subclassing, or double-duty; but it's probably good to have the tests always throw the same thing at both Find and TryFind, checking for the boolean/exception, and if TryFind succeeds that they give the same data.

With subclassing, it'd be something like

protected abstract PemFields Find(ReadOnlySpan<char> text);
protected abstract void FailToFind<TException>(ReadOnlySpan<char> text) where TException : Exception;

// (I have it returning the value, in case any multi-read tests wanted to compare two sets against each other)
private PemFields TestSuccess(string content, Range expectedLocation, Range expectedLabel, Range expectedBase64Data, int offset = 0)
{
    PemFields fields = Find(content.AsSpan(offset));
    // helper, as described before
}

public class PemEncodingFindTests_TryFind : PemEncodingFindTests
{
    protected override void FailToFind<TException>(ReadOnlySpan<char> text) where TException : Exception
    {
        Assert.False(PemEncoding.TryFind(text, out PemFields fields));
        Assert.True(fields.Location.Equals(default(Range)));
        ...
    }
}

@lpereira
Copy link
Contributor

Has this been fuzzed?

@vcsjones
Copy link
Member Author

@bartonjs

I think this is ready for another round of review. Some comments:

  1. I am not sure how best to address the performance suggestions of handling the base64. The summary of the discussion so far:

    • It's currently slower than the actual decoding implementation because it's a straight forward implementation that looks at it character-by-character.
    • System.Buffers.Text.Base64 is not suitable because it does not handle white space.
    • Convert.*Base64* APIs only do actual decoding. If we're worried about perf, we could decode in to a rented buffer.

    My level of comfort with performance-all-the-things isn't that high. My proposal would be to leave it as-is, and create a new API proposal somewhere that can use some of the existing decoding machinery to get the length with validation as well. I see utility of this as an API that is more generally available elsewhere. Alternatively, we could just use the existing decoding APIs and decode in to a scratch buffer. I expect most PEM encoded data to be a certificate or RSA-2048 PKCS1/8 key. That probably means decoding in to a stack buffer isn't worth implementing, so, pooled buffer. Decoding potentially private keys in to a pooled buffer makes me somewhat hesitant, but it looks like there are existing patterns to make that reasonable.

  2. I got a little lost trying to follow the suggestions for doc-comments on PemFields, how <value> and <summary> relate. I think somewhere in there someone said "don't have both", so I stuck with <summary>.

@vcsjones
Copy link
Member Author

@ericsampson I don't think the intention was to ship this out of band. I don't know what's involved in that; or if there is a better place for these APIs to live that also ships out of band. @bartonjs?

@ericsampson
Copy link

@vcsjones, thanks for considering if there is a way/place for it. There’s a lot of Core 2 apps that aren’t going away anytime soon :)

@bartonjs
Copy link
Member

Since it's not modifying existing types it's not impossible from a technical sense; but things get awkward if we release them as a NuGet package and also ship it as part of the framework for newer releases.

I can't say that it'll happen; but it's something that I'll bring up along with some other utility things in a similar bucket.

}

public static void AssertThrows<E, T>(Span<T> span, AssertThrowsAction<T> action) where E : Exception
public static Exception AssertThrows<E, T>(Span<T> span, AssertThrowsAction<T> action) where E : Exception
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't these return E as a stronger return type?

return exception;
}

public static void Throws<E, T>(string expectedParamName, ReadOnlySpan<T> span, AssertThrowsActionReadOnly<T> action)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other Throws(expectedParamName methods return the exception; might as well go ahead and do it here for consistency (of type E).

And it probably makes sense to add the writable-span version for symmetry.

// The PostEB must either end at the end of the string, or
// have at least one white space character after it.
if (pemEndIndex < pemData.Length - 1 &&
!char.IsWhiteSpace(pemData[pemEndIndex]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GrabYourPitchforks Do you have opinions on this "IsWhiteSpace" for determining the acceptability for end-of-payload?

The idea is -----BEGIN FOO-----base64==----END FOO----- is legal, -----BEGIN FOO-----base64==----END FOO-----M isn't (post-encapsulation-boundary might be part of some other text), but -----BEGIN FOO-----base64==----END FOO----- M is. If the space is U+200A (hair space) is that enough to say we've hit the "end of the word" and thus the end of the post-EB? Or would you prefer we stick to the RFC's *W production and limit it to { space, \t, \r, \n, \v, Form-feed } or even our limited *W ({ space, \t, \r, \n })?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking we should never have a default behavior more lax than the RFC (though we can be more strict by default). This is the same philosophy we also followed with the JSON stack. So I'd suggest limiting this to RFC-blessed whitespace chars.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the end of the payload. Nevermind then. No real opinions one way or another. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vcsjones Can you throw in a test that uses \u200A before and after to show that it satisfies the pre-preEB/post-postEB boundary constraints? Alternatively, change the two IsWhiteSpaces to only allow space-tab-CR-LF and add a test that says we don't support the full set of IsWhiteSpace?

Since both @GrabYourPitchforks and I are iffy, it's your call on permissive or restrictive at the boundary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bartonjs
Since the base-64 decoding does not permit /v and /f, it made sense to me to limit the base-64 check to the more narrow definition of white space. It also makes sense to me to have a single definition of "white space" when decoding, so I changed the pre and post boundaries to match the same definition of white space as the base-64 validation.

if (!TryCountBase64(pemData[contentRange],
out int base64start,
out int base64end,
out int decodedSize))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These look like they'd all fit on one line for the normal github aperture. If you want/need the newlines, chop before the first argument and only indent 4 spaces.

Comment on lines +151 to +152
Range base64range = (contentStartIndex + base64start)..
(contentStartIndex + base64end);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Range base64range = (contentStartIndex + base64start)..
(contentStartIndex + base64end);
Range base64range = (contentStartIndex + base64start)..(contentStartIndex + base64end);

or

Suggested change
Range base64range = (contentStartIndex + base64start)..
(contentStartIndex + base64end);
Range base64range =
(contentStartIndex + base64start)..(contentStartIndex + base64end);

int size = PostEBPrefix.Length + label.Length + Ending.Length;
Debug.Assert(destination.Length >= size);
PostEBPrefix.AsSpan().CopyTo(destination);
label.CopyTo(destination[PostEBPrefix.Length..]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the left-defined-right-open calls look much nicer as explicitly saying destination.Slice(PostEBPrefix.Length).

Maybe I'll feel different in a few years :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'll feel different in a few years :)

I kind of agree but felt weird mixing range-style slicing and literally calling Slice.

I think that the left-defined-right-open calls look much nicer as explicitly saying

Is this a stream of consciousness thought, or a "please change [X..] to Slice(x)??

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a stream of consciousness thought, or a "please change [X..] to Slice(x)??

An "I'd prefer" 😄.

We don't have a style rule or strong convention; but some of us have a general leeriness of the syntax because str[a..b] is Substring, and arr[a..b] is effectively Slice().ToArray() -- the decision for range indexers is they return a collection of the same type, not a Span/Memory-equivalent projection. For Span/ROSpan that means it's "zero"-cost slicing, but as syntax it's a bit of a burden on reviewers.

My personal filter is that I can accept the reviewer burden on "wait, are we a span type already?" for [start..end] (to avoid changing it to start/count); but for [start..] (Slice(start)) and [..end] (Slice(0, end)) the call to Slice reduces my cognitive load without adding increased load at changing parameter semantics. (I'm also happy with changing [start..end] to Slice(start, end - start), but it's more contextual)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable. I removed the open-ended Range syntax in favor of explicit Slice. It's much to my preference to keep the [x..y] syntax for the reason you described.

previousSpaceOrHyphen = true;
}

return true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A space-or-hyphen has to be followed by another labelchar. So isn't this return !previousSpaceOrHyphen;?

I think right now it'll say "HELLO-" is a valid label. (I don't see a test confirming one way or another)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct. I misread the RFC.

Last character must be a labelchar.
Comment on lines +167 to +168
[InlineData("H")]
public void TryFind_True_LabelsWithHyphenSpace(string label)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"H" has no hyphens or spaces, so ideally either it moves to a different method, or the name just gets updated here.

It also looks like the name says TryFind but the body uses Find.

@bartonjs bartonjs removed the NO-REVIEW Experimental/testing PR, do NOT review it label Mar 19, 2020
bartonjs
bartonjs previously approved these changes Mar 19, 2020
@bartonjs bartonjs dismissed their stale review March 19, 2020 22:45

Waiting on a test codifying pre-preEB/post-postEB boundary.

@bartonjs bartonjs merged commit 11c8c80 into dotnet:master Mar 22, 2020
@vcsjones vcsjones deleted the one-shot-pem branch March 22, 2020 19:26
@ghost ghost locked as resolved and limited conversation to collaborators Dec 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

One-shot PEM reader

9 participants