One shot pem reader and writer#32260
Conversation
|
Note regarding the This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, to please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change. |
Well, I tried. |
| Range postebLineRange = lineRange; | ||
| int postebLinePosition = preebLinePosition; | ||
|
|
||
| // in lax decoding a posteb may appear on the same line as the preeb. |
There was a problem hiding this comment.
This wasn't actually immediately clear to me from the RFC, if we even want to support single-line PEMs.
If not, this code can get a little simpler but I suspect someone will wish it "just worked".
There was a problem hiding this comment.
Since I think we're supporting lax base64 (we don't require that there be exactly 64 base64 characters per line with no whitespace other than newlines) it seems like we want the full laxtextualmsg treatment, so no newlines are required anywhere
laxtextualmsg = *W preeb
laxbase64text
posteb *W
preeb = "-----BEGIN " label "-----" ; unlike [RFC1421] (A)BNF,
; eol is not required (but
posteb = "-----END " label "-----" ; see [RFC1421], Section 4.4)
W = WSP / CR / LF / %x0B / %x0C ; whitespace
laxbase64text = *(W / base64char) [base64pad *W [base64pad *W]]
Presumably, removing all newline processing (and ReadLine) and just using IndexOf to find preeb/posteb would actually make things simpler?
There was a problem hiding this comment.
Presumably, removing all newline processing (and ReadLine) and just using IndexOf to find preeb/posteb would actually make things simpler?
It still behaved as lax, but agree that it isn't necessary to handle line-by-line. I will address. It seems I may have misunderstood the whitespace requirement around the preeb / posteb for lax.
|
Looking at failures. |
|
OSX failures are #702. |
| continue; | ||
| } | ||
|
|
||
| int postebLength = Posteb.Length + label.Length + Ending.Length; |
There was a problem hiding this comment.
I don't think so. label's length is going to be based on a found label in the PEM, and we've already checked for the "-----BEGIN " and "-----", so we know that the ROS (whose max length is an Int32) contains something of this length.
| for (int index = 1; index < data.Length; index++) | ||
| { | ||
| char c = data[index]; | ||
| if (!IsLabelChar(c) && c != ' ' && c != '-') |
There was a problem hiding this comment.
Redundant c != '-' check?
There was a problem hiding this comment.
I don't think so. IsLabelChar says "Is this a valid character, except for "-"? The first character in allowed to be anything except a space or hyphen. The other characters may be any of those characters, including the hyphen or space. So it's a double negation. I'll try and make this easier to follow.
...raries/System.Security.Cryptography.Encoding/src/System/Security/Cryptography/PemEncoding.cs
Outdated
Show resolved
Hide resolved
...raries/System.Security.Cryptography.Encoding/src/System/Security/Cryptography/PemEncoding.cs
Outdated
Show resolved
Hide resolved
...ibraries/System.Security.Cryptography.Encoding/src/System/Security/Cryptography/PemFields.cs
Show resolved
Hide resolved
| int precedingWhiteSpace = 0; | ||
| int trailingWhiteSpace = 0; | ||
| (int offset, int length) = content.GetOffsetAndLength(data.Length); | ||
| for (int index = offset; index < offset + length; index++) |
There was a problem hiding this comment.
I don't know how important perf is here, but I suspect this validation is quite slow (and doing the actual encoding using the Base64 APIs built on top of avx2/Ssse3 might actually be faster).
Alternatively, we could use some of the Base64 implementation as inspiration to make this method faster.
How large can length be?
There was a problem hiding this comment.
but I suspect this validation is quite slow
Yeah, it occurred to me but when I looked around for inspiration such as at
I see it still doing character-by-character examination to account for various whitespace (albeit with unsafe tricks), just that the method has a looser definition of whitespace than the RFC.
Is there a better place to be looking here for this?
How large can length be?
It could be very large, but the typical use-case for PEM is PKI and CMS. CMS can be several thousand characters.
There was a problem hiding this comment.
Validation could be moved to System.Buffers.Text.Base64. It is already vectorized, and contains a (private) decoding map that can be used to efficiently validate the input and still provide a scalar fallback. The functions that decode using AVX2 and SSSE3 already check for invalid input before decoding, so it should be trivial.
There was a problem hiding this comment.
We would need to make that validation "whitespace" aware. The existing S.B.T.Base64 APIs do not support ignoring whitespace characters.
There was a problem hiding this comment.
FWIW, something like this (which includes the work of actually decoding) is ~2x faster than doing the linear validation (for a ~1k buffer without any leading/trailing whitespace). I am not necessarily suggesting that this is the way to go (especially given the unnecessary work it does and is no where near "optimal"), but at the very least it avoids having to provide your own custom logic for the base64 validation. Ideally, this type of API is built-in to either the Convert or Base64 classes.
BenchmarkDotNet=v0.12.0, OS=Windows 10.0.19041
Intel Core i7-6700 CPU 3.40GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-alpha1-015914
[Host] : .NET Core 5.0.0 (CoreCLR 5.0.19.56303, CoreFX 5.0.19.56306), X64 RyuJIT
Job-XEOKAL : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT
PowerPlanMode=00000000-0000-0000-0000-000000000000 Toolchain=CoreRun MaxIterationCount=10
MinIterationCount=5 WarmupCount=3
| Method | Mean | Error | StdDev | Median | Min | Max | Ratio | Gen 0 | Gen 1 | Gen 2 | Allocated |
|---|---|---|---|---|---|---|---|---|---|---|---|
| ValidateOriginal | 6.308 us | 0.1829 us | 0.1088 us | 6.316 us | 6.186 us | 6.548 us | 1.00 | - | - | - | - |
| ValidateNew | 3.290 us | 0.0724 us | 0.0431 us | 3.302 us | 3.214 us | 3.352 us | 0.52 | - | - | - | - |
private static readonly char[] s_whitespace = new char[] { ' ', '\t', '\r', '\n' };
private static readonly char[] s_base64 = new char[]
{
'0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h' ,'i', 'j',
'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r' ,'s', 't',
'u', 'v', 'w', 'x', 'y', 'z', 'A', 'B' ,'C', 'D',
'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L' ,'M', 'N',
'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V' ,'W', 'X',
'Y', 'Z', '+', '/'
};
private bool IsValidBase64(out int decodedSize)
{
ReadOnlySpan<char> input = _base64String.AsSpan();
int startIndex = input.IndexOfAny(s_base64);
if (startIndex == -1)
{
startIndex = 0;
}
input = input.Slice(startIndex);
int finalIndex = input.LastIndexOfAny(s_whitespace);
if (finalIndex == -1)
{
finalIndex = input.Length;
}
input = input.Slice(0, finalIndex);
byte[] buffer = ArrayPool<byte>.Shared.Rent(Base64.GetMaxDecodedFromUtf8Length(input.Length));
bool result = Convert.TryFromBase64Chars(input, buffer, out decodedSize);
if (!result)
{
decodedSize = 0;
}
ArrayPool<byte>.Shared.Return(buffer, clearArray: true);
return result;
}
...raries/System.Security.Cryptography.Encoding/src/System/Security/Cryptography/PemEncoding.cs
Show resolved
Hide resolved
src/libraries/System.Security.Cryptography.Encoding/tests/PemEncodingTests.cs
Outdated
Show resolved
Hide resolved
| namespace System.Security.Cryptography | ||
| { | ||
| /// <summary> | ||
| /// RFC-7468 PEM (Privacy-Enhanced Mail) parsing and encoding. |
There was a problem hiding this comment.
| /// RFC-7468 PEM (Privacy-Enhanced Mail) parsing and encoding. | |
| /// Provides methods for reading and writing the IETF RFC 7468 subset of PEM (Privacy-Enhanced Mail) | |
| /// textual encodings. This class cannot be inherited. |
| public static class PemEncoding | ||
| { | ||
| private const string Preeb = "-----BEGIN "; | ||
| private const string Posteb = "-----END "; |
There was a problem hiding this comment.
I think these should be PreEB and PostEB, since EB is "encapsulation boundary". Perhaps even PreEBPrefix/PostEBPrefix
| /// Finds the first PEM-encoded data. | ||
| /// </summary> | ||
| /// <param name="pemData"> | ||
| /// A span containing the PEM encoded data. |
There was a problem hiding this comment.
| /// A span containing the PEM encoded data. | |
| /// The text containing the PEM-encoded data. |
| /// A span containing the PEM encoded data. | ||
| /// </param> | ||
| /// <exception cref="ArgumentException"> | ||
| /// <paramref name="pemData"/> does not contain a well-formed PEM encoded value. |
There was a problem hiding this comment.
| /// <paramref name="pemData"/> does not contain a well-formed PEM encoded value. | |
| /// <paramref name="pemData"/> does not contain a well-formed PEM-encoded value. |
| /// <paramref name="pemData"/> does not contain a well-formed PEM encoded value. | ||
| /// </exception> | ||
| /// <returns> | ||
| /// A <see cref="PemFields"/> structure that contains the location, label, and |
There was a problem hiding this comment.
| /// A <see cref="PemFields"/> structure that contains the location, label, and | |
| /// A value that specifies the location, label, and |
| [Theory] | ||
| [InlineData("\tOOPS")] | ||
| [InlineData(" OOPS")] | ||
| [InlineData("-OOPS")] |
There was a problem hiding this comment.
| [InlineData("-OOPS")] | |
| [InlineData("-OOPS")] | |
| [InlineData("OO PS")] |
| } | ||
|
|
||
| [Fact] | ||
| public static void TryFind_False_PrecedingLinesAndSignificantCharsBeforePreeb() |
There was a problem hiding this comment.
I guess even under full lax, the character before the starting - of the pre-encapsulation boundary must not exist or be whitespace.
| [Fact] | ||
| public static void TryFind_False_ContentOnPostEbLine() | ||
| { | ||
| string content = "-----BEGIN TEST-----\nZm9v\n-----END TEST-----boop"; |
There was a problem hiding this comment.
I accept this interpretation, even under full lax.
| [Fact] | ||
| public static void TryFind_False_IncompletePostEncapBoundary() | ||
| { | ||
| string content = "-----BEGIN TEST-----\nZm9v\n-----END TEST"; |
There was a problem hiding this comment.
I don't think I saw a test, one way or the other, on mismatched labels between preeb and posteb.
| public static void TryFind_False_NoPostEncapBoundary() | ||
| { | ||
| string content = "-----BEGIN TEST-----\nZm9v\n"; | ||
| Assert.False(PemEncoding.TryFind(content, out _)); |
There was a problem hiding this comment.
I don't know if we want subclassing, or double-duty; but it's probably good to have the tests always throw the same thing at both Find and TryFind, checking for the boolean/exception, and if TryFind succeeds that they give the same data.
With subclassing, it'd be something like
protected abstract PemFields Find(ReadOnlySpan<char> text);
protected abstract void FailToFind<TException>(ReadOnlySpan<char> text) where TException : Exception;
// (I have it returning the value, in case any multi-read tests wanted to compare two sets against each other)
private PemFields TestSuccess(string content, Range expectedLocation, Range expectedLabel, Range expectedBase64Data, int offset = 0)
{
PemFields fields = Find(content.AsSpan(offset));
// helper, as described before
}
public class PemEncodingFindTests_TryFind : PemEncodingFindTests
{
protected override void FailToFind<TException>(ReadOnlySpan<char> text) where TException : Exception
{
Assert.False(PemEncoding.TryFind(text, out PemFields fields));
Assert.True(fields.Location.Equals(default(Range)));
...
}
}|
Has this been fuzzed? |
src/libraries/System.Security.Cryptography.Encoding/src/Resources/Strings.resx
Show resolved
Hide resolved
|
I think this is ready for another round of review. Some comments:
|
|
@ericsampson I don't think the intention was to ship this out of band. I don't know what's involved in that; or if there is a better place for these APIs to live that also ships out of band. @bartonjs? |
|
@vcsjones, thanks for considering if there is a way/place for it. There’s a lot of Core 2 apps that aren’t going away anytime soon :) |
|
Since it's not modifying existing types it's not impossible from a technical sense; but things get awkward if we release them as a NuGet package and also ship it as part of the framework for newer releases. I can't say that it'll happen; but it's something that I'll bring up along with some other utility things in a similar bucket. |
| } | ||
|
|
||
| public static void AssertThrows<E, T>(Span<T> span, AssertThrowsAction<T> action) where E : Exception | ||
| public static Exception AssertThrows<E, T>(Span<T> span, AssertThrowsAction<T> action) where E : Exception |
There was a problem hiding this comment.
Can't these return E as a stronger return type?
| return exception; | ||
| } | ||
|
|
||
| public static void Throws<E, T>(string expectedParamName, ReadOnlySpan<T> span, AssertThrowsActionReadOnly<T> action) |
There was a problem hiding this comment.
The other Throws(expectedParamName methods return the exception; might as well go ahead and do it here for consistency (of type E).
And it probably makes sense to add the writable-span version for symmetry.
| // The PostEB must either end at the end of the string, or | ||
| // have at least one white space character after it. | ||
| if (pemEndIndex < pemData.Length - 1 && | ||
| !char.IsWhiteSpace(pemData[pemEndIndex])) |
There was a problem hiding this comment.
@GrabYourPitchforks Do you have opinions on this "IsWhiteSpace" for determining the acceptability for end-of-payload?
The idea is -----BEGIN FOO-----base64==----END FOO----- is legal, -----BEGIN FOO-----base64==----END FOO-----M isn't (post-encapsulation-boundary might be part of some other text), but -----BEGIN FOO-----base64==----END FOO----- M is. If the space is U+200A (hair space) is that enough to say we've hit the "end of the word" and thus the end of the post-EB? Or would you prefer we stick to the RFC's *W production and limit it to { space, \t, \r, \n, \v, Form-feed } or even our limited *W ({ space, \t, \r, \n })?
There was a problem hiding this comment.
Generally speaking we should never have a default behavior more lax than the RFC (though we can be more strict by default). This is the same philosophy we also followed with the JSON stack. So I'd suggest limiting this to RFC-blessed whitespace chars.
There was a problem hiding this comment.
Ah, the end of the payload. Nevermind then. No real opinions one way or another. :)
There was a problem hiding this comment.
@vcsjones Can you throw in a test that uses \u200A before and after to show that it satisfies the pre-preEB/post-postEB boundary constraints? Alternatively, change the two IsWhiteSpaces to only allow space-tab-CR-LF and add a test that says we don't support the full set of IsWhiteSpace?
Since both @GrabYourPitchforks and I are iffy, it's your call on permissive or restrictive at the boundary.
There was a problem hiding this comment.
@bartonjs
Since the base-64 decoding does not permit /v and /f, it made sense to me to limit the base-64 check to the more narrow definition of white space. It also makes sense to me to have a single definition of "white space" when decoding, so I changed the pre and post boundaries to match the same definition of white space as the base-64 validation.
| if (!TryCountBase64(pemData[contentRange], | ||
| out int base64start, | ||
| out int base64end, | ||
| out int decodedSize)) |
There was a problem hiding this comment.
These look like they'd all fit on one line for the normal github aperture. If you want/need the newlines, chop before the first argument and only indent 4 spaces.
| Range base64range = (contentStartIndex + base64start).. | ||
| (contentStartIndex + base64end); |
There was a problem hiding this comment.
| Range base64range = (contentStartIndex + base64start).. | |
| (contentStartIndex + base64end); | |
| Range base64range = (contentStartIndex + base64start)..(contentStartIndex + base64end); |
or
| Range base64range = (contentStartIndex + base64start).. | |
| (contentStartIndex + base64end); | |
| Range base64range = | |
| (contentStartIndex + base64start)..(contentStartIndex + base64end); |
| int size = PostEBPrefix.Length + label.Length + Ending.Length; | ||
| Debug.Assert(destination.Length >= size); | ||
| PostEBPrefix.AsSpan().CopyTo(destination); | ||
| label.CopyTo(destination[PostEBPrefix.Length..]); |
There was a problem hiding this comment.
I think that the left-defined-right-open calls look much nicer as explicitly saying destination.Slice(PostEBPrefix.Length).
Maybe I'll feel different in a few years :)
There was a problem hiding this comment.
Maybe I'll feel different in a few years :)
I kind of agree but felt weird mixing range-style slicing and literally calling Slice.
I think that the left-defined-right-open calls look much nicer as explicitly saying
Is this a stream of consciousness thought, or a "please change [X..] to Slice(x)??
There was a problem hiding this comment.
Is this a stream of consciousness thought, or a "please change [X..] to Slice(x)??
An "I'd prefer" 😄.
We don't have a style rule or strong convention; but some of us have a general leeriness of the syntax because str[a..b] is Substring, and arr[a..b] is effectively Slice().ToArray() -- the decision for range indexers is they return a collection of the same type, not a Span/Memory-equivalent projection. For Span/ROSpan that means it's "zero"-cost slicing, but as syntax it's a bit of a burden on reviewers.
My personal filter is that I can accept the reviewer burden on "wait, are we a span type already?" for [start..end] (to avoid changing it to start/count); but for [start..] (Slice(start)) and [..end] (Slice(0, end)) the call to Slice reduces my cognitive load without adding increased load at changing parameter semantics. (I'm also happy with changing [start..end] to Slice(start, end - start), but it's more contextual)
There was a problem hiding this comment.
Seems reasonable. I removed the open-ended Range syntax in favor of explicit Slice. It's much to my preference to keep the [x..y] syntax for the reason you described.
...raries/System.Security.Cryptography.Encoding/src/System/Security/Cryptography/PemEncoding.cs
Outdated
Show resolved
Hide resolved
...raries/System.Security.Cryptography.Encoding/src/System/Security/Cryptography/PemEncoding.cs
Show resolved
Hide resolved
| previousSpaceOrHyphen = true; | ||
| } | ||
|
|
||
| return true; |
There was a problem hiding this comment.
A space-or-hyphen has to be followed by another labelchar. So isn't this return !previousSpaceOrHyphen;?
I think right now it'll say "HELLO-" is a valid label. (I don't see a test confirming one way or another)
There was a problem hiding this comment.
You're correct. I misread the RFC.
Last character must be a labelchar.
| [InlineData("H")] | ||
| public void TryFind_True_LabelsWithHyphenSpace(string label) |
There was a problem hiding this comment.
"H" has no hyphens or spaces, so ideally either it moves to a different method, or the name just gets updated here.
It also looks like the name says TryFind but the body uses Find.
Waiting on a test codifying pre-preEB/post-postEB boundary.
Closes #29588
I thought I would give this one a shot.
/cc @bartonjs