Fixes StringBuilder unbounded size growth in Clear() when we use a mix of Append and Insert#16926
Conversation
| info.AddValue(ThreadIDField, 0); | ||
| } | ||
|
|
||
| [System.Diagnostics.Conditional("DEBUG")] |
There was a problem hiding this comment.
I will move this method to a new file since StringBuilder is already a partial class and this class is already long
| } | ||
|
|
||
| [System.Diagnostics.Conditional("DEBUG")] | ||
| private void DebugDump() |
There was a problem hiding this comment.
I would make this return IEnumerable. That way you don't need to limit the number of chunks you return. As you expand in the watch window in the debugger, you see however many you want.
In the body you would use the yield keyword
There was a problem hiding this comment.
As you point out you must then use #if DEBUG. Or better still you can conditionally include it in the project using Condition="'$(Configuration)' == 'Debug'" .. but nobody else seems to do that so maybe there's a reason not to do that.
There was a problem hiding this comment.
I would avoid the word Dump in the method name. But this being a debug-only print, maybe it is ok to have it like this?
There was a problem hiding this comment.
I dont mind what it's called. ShowChunks ?
|
Your "fixes" note at the top won't work as it's a CoreFX number. You should change it to |
| var debugText = new StringBuilder(); | ||
| for (int i = 0; chunk != null && i < maxChunksToPrint; i++) | ||
| { | ||
| debugText.Insert(0, "|" + new string(chunk.m_ChunkChars)); |
There was a problem hiding this comment.
Ironically by doing Insert you are forcing StringBuilder to behave suboptimally...
You should be able to just Append repeatedly to add each char in the chunk.
Or if you want to get fancy (doesn't matter given it's debug code, but as a learning exercise) you can avoid creating a StringBuilder entirely and write directly into a string. See https://msdn.microsoft.com/en-us/magazine/mt814808.aspx and search for text "Now, not only have you avoided the allocation, you’re writing directly into the string’s memory on the heap"
| } | ||
| debugText.Append("|"); | ||
|
|
||
| Debug.WriteLine(debugText.ToString().Replace("\0", ".")); |
There was a problem hiding this comment.
As discussed instead of writing to Debug stream, I suggest yield return stringRepresentingAChunk;
There was a problem hiding this comment.
Once you're returning an enumreable, you won't need the | to separate blocks. Why not add a second debug method to give a single string with | separators, simply:
public string DebugDumpString()
{
return string.Join("|", DebugDump());
}|
|
||
| public StringBuilder Clear() | ||
| { | ||
| Capacity = Math.Max(DefaultCapacity, Math.Min(Capacity, (int)(Length * 1.2))); |
There was a problem hiding this comment.
needs a comment for your successor developers, eg something like
// Retain enough storage to store the same size content, plus 20% buffer, to help avoid reallocating during repeated useWe do'nt need to mention "and also because currently using Insert can lead to pathologically large allocations" ...
| <Compile Include="$(MSBuildThisFileDirectory)System\Text\Latin1Encoding.cs" /> | ||
| <Compile Include="$(MSBuildThisFileDirectory)System\Text\NormalizationForm.cs" /> | ||
| <Compile Include="$(MSBuildThisFileDirectory)System\Text\StringBuilder.cs" /> | ||
| <Compile Include="$(MSBuildThisFileDirectory)System\Text\StringBuilder.Debug.cs" /> |
There was a problem hiding this comment.
Should this file be included with a Condition for Debug Configuration only?
| { | ||
| chars[i] = curChunksChars[i]; | ||
| } | ||
| }).Replace('\0', '.'); |
There was a problem hiding this comment.
We're doing replace to better visibly see the indices where char is empty in the chunk
| <Compile Include="$(MSBuildThisFileDirectory)System\Text\Latin1Encoding.cs" /> | ||
| <Compile Include="$(MSBuildThisFileDirectory)System\Text\NormalizationForm.cs" /> | ||
| <Compile Include="$(MSBuildThisFileDirectory)System\Text\StringBuilder.cs" /> | ||
| <Compile Condition="'$(Configuration)' == 'Debug'" Include="$(MSBuildThisFileDirectory)System\Text\StringBuilder.Debug.cs" /> |
There was a problem hiding this comment.
Nit: we usually put the condition after the file name.
| Debug.WriteLine('|' + string.Join('|', ShowChunksInOrder(maxChunksToShow)) + '|'); | ||
| } | ||
|
|
||
| private IEnumerable<string> ShowChunksInOrder(int maxChunksToShow) |
There was a problem hiding this comment.
I didnt end up using yield because I am populating the IEnumerable from last to first before returning
|
Is this change a pure improvement across the board, or are there situations where this change is going to regress the behavior? If yes - what are those and how much are they regressing in the worst cast? |
To answer that we would need to have a better understanding of when people call Clear() and what they do after that. @vancem do you have a sense of that? |
|
Also, this change is changing observable behaviors, including exceptions to be thrown in cases where they were not thrown before. Consider: |
| public StringBuilder Clear() | ||
| { | ||
| // Retain enough storage to store the same size content, plus 20% buffer, to help avoid reallocating during repeated use | ||
| Capacity = Math.Max(DefaultCapacity, Math.Min(Capacity, (int)(Length * 1.2))); |
There was a problem hiding this comment.
I would consider Clear() to be on the 'hot' path, so I care about it being 'lean' My concern her is that setting the capacity here forces it to potentially copy the data that is currently in the stringbuilder to its new home THEN clear it (which is clearly a waste)
This also does not fix the issue if someone uses sb.Length=0 (instead of Clear()).
Thus I would prefer if the change is in Length
if (chunk != this)
{
// Avoid possible infinite capacity growth. See https://github.com/dotnet/coreclr/pull/16926
int capacityToPreserve = Math.Min(Capacity, Math.Max(Length * 6 / 5, m_ChunkChars.Length));
// We crossed a chunk boundary when reducing the Length. We must replace this middle-chunk with a new larger chunk,
// to ensure the original capacity is preserved.
int newLen = capacityToPreserve - chunk.m_ChunkOffset;
While you are in there I would also suggest putting an #if DEBUG around the call to
-
int originalCapacity = Capacity;
Since this is on the critical path for Clear() and originalCapacity is only (now) used for Debug code and I am not sure that the JIT optimizer could remove the code associated with this statement completely.
There was a problem hiding this comment.
I tweeked the code above to change DefaultCapacity to m_ChunkChars.Length. I did this because it is best to preserve (as much as we can) the invariant that Capacity does not shrink (to preserve the Stringbuilder growing until a high water mark and then being efficient thereafter).
In the code I suggested, I am only LOWER the Capacity (that is change current behavior) in pretty narrow circumstances. In particular it is never lowered below the last chunk size (which is typically the largest chunk). I can come up with cases where we would be less efficient than before, but they involve insert() and they were probably not very efficient before either (we have not fixed that). |
vancem
left a comment
There was a problem hiding this comment.
Looks good. In summary, to keep the capacity from growing in an unbounded, way, we cap it to be 120% of the TOTAL length OR the capacity of the last chunk, whichever is larger.
The most interesting cases are when stringbuilders are reused. In this case unless 'insert' is used, you will end up with one chunk, and thus the capacity does not change (since capacity == m_chunkChars.Length). Thus there is no perf difference.
In a multi-buffer case (the ones that caused the bug), it caps it to the last chunk or 1.2X the length. This also feels benign.
vancem
left a comment
There was a problem hiding this comment.
I see that you removed the asserts, but think that is OK because the first assert is not interesting (we are not changing anything that would change capacity), and the second assert is simply no longer true (we are intentionally breaking it).
| int numChunksToShow = 0; | ||
| StringBuilder lastChunk = null; | ||
| // Gets numChunksToShow. If numChunksToShow is larger than maxChunksToShow, then returns last chunk | ||
| GetNumChunksToShow(this, ref numChunksToShow, maxChunksToShow, ref lastChunk); |
There was a problem hiding this comment.
ref int is unusual - if you do not recurse you can simply make it a return value.
consider renaming lastChunk to headChunk since last is a bit ambiguous (it's first logically, last in the datastructure)
rather than ref parameters, consider returning a tuple (int numChunksToShow, StringBuilder headChunk)
| private void GetNumChunksToShow(StringBuilder sb, ref int numToShow, int maxChunksToShow, ref StringBuilder lastChunk) | ||
| { | ||
| if (sb.m_ChunkPrevious != null) | ||
| GetNumChunksToShow(sb.m_ChunkPrevious, ref numToShow, maxChunksToShow, ref lastChunk); |
There was a problem hiding this comment.
Generally I would not use recursion for something like this that can be done simply and easily without recursion, because it adds the risk of stack overflow (in this case there there are a huge number of chunks). Also it is nice to avoid ref parameters because they have a little more cognitive load. Eg.,
private (int numChunksToShow, StringBuilder headChunk) GetNumChunksToShow(StringBuilder sb, int maxChunksToShow)
{
StringBuilder current = sb.m_ChunkPrevious;
int count = 0;
while (current != null && count < maxChunksToShow)
{
count++;
current = current.m_ChunkPrevious;
}
return (count, current)
}|
|
||
| namespace System.Text | ||
| { | ||
| public sealed partial class StringBuilder : ISerializable |
There was a problem hiding this comment.
Nit, interface implementation (same as base class derivation) need not be declared on all partial classes. In this case as ISerializable has nothing to do with this file, I would remove it. It's just more to maintain.
| StringBuilder lastChunk = null; | ||
| // Gets numChunksToShow. If numChunksToShow is larger than maxChunksToShow, then returns last chunk | ||
| GetNumChunksToShow(this, ref numChunksToShow, maxChunksToShow, ref lastChunk); | ||
| Span<string> chunkChars = new string[numChunksToShow]; |
There was a problem hiding this comment.
What does Span gain you here? It's just an array and used only as an array.
| Span<string> chunkChars = new string[numChunksToShow]; | ||
| var sb = lastChunk ?? this; | ||
|
|
||
| while (numChunksToShow > 0) |
| sb = sb.m_ChunkPrevious; | ||
| numChunksToShow--; | ||
| } | ||
| return chunkChars.ToArray(); |
There was a problem hiding this comment.
This is calling ToArray on a Span that is wrapping an array.
| // Avoid possible infinite capacity growth. See https://github.com/dotnet/coreclr/pull/16926 | ||
| int capacityToPreserve = Math.Min(Capacity, Math.Max(Length * 6 / 5, m_ChunkChars.Length)); | ||
|
|
||
| // We crossed a chunk boundary when reducing the Length. We must replace this middle-chunk with a new larger chunk, |
There was a problem hiding this comment.
This comment should probably be above the line you added above.
|
@dotnet-bot test Tizen armel Cross Checked Innerloop Build and Test please |
028f56b to
2246bf3
Compare
| { | ||
| int numChunks = 0; | ||
| StringBuilder current = this; | ||
| while (current != null) |
There was a problem hiding this comment.
You could have done this in one loop, by starting off storing head = this and once numChunks > maxChunksToShow you know you have skipped enough, and each loop after that you update head = head.m_ChunkPrevious. When you have looped to the end, head has moved along numChunks - maxChunksToShow times, and is pointing to the chunk you want.
Once you have done that I would inline this method into the one above. Actually I would just have a single method for everything.
551e283 to
622a9ea
Compare
| current = current.m_ChunkPrevious; | ||
| } | ||
| current = head; | ||
| string[] chunks = new string[count]; |
There was a problem hiding this comment.
@danmosemsft initially because you requested we have a method that returns IEnumerable, showing one chunk at a time, then I decided to create this string array.
But now I think you suggested to keep just one method for the entire ShowChunks() method. Therefore I won't need this string array and can then also use span to set the output.
| } | ||
| current = head; | ||
| string[] chunks = new string[count]; | ||
| for (int i = count; i > 0; i--) |
There was a problem hiding this comment.
Nit, not worth resetting for, it would be more normal to do this because within the loop the index is then natural.
for (int i = count - 1; i >= 0; i--)
chunks[i] ...
|
Shouldn't there be a very fast way to set I'm loading a big file and need to reset the |
Fixes dotnet/corefx#27625
Added tests in PR dotnet/corefx#28038
cc: @danmosemsft