Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Fixes StringBuilder unbounded size growth in Clear() when we use a mix of Append and Insert#16926

Merged
danmoseley merged 10 commits into
dotnet:masterfrom
maryamariyan:fix-stringbuilder-clear
Mar 23, 2018
Merged

Fixes StringBuilder unbounded size growth in Clear() when we use a mix of Append and Insert#16926
danmoseley merged 10 commits into
dotnet:masterfrom
maryamariyan:fix-stringbuilder-clear

Conversation

@maryamariyan
Copy link
Copy Markdown

@maryamariyan maryamariyan commented Mar 14, 2018

Fixes dotnet/corefx#27625

Added tests in PR dotnet/corefx#28038

cc: @danmosemsft

info.AddValue(ThreadIDField, 0);
}

[System.Diagnostics.Conditional("DEBUG")]
Copy link
Copy Markdown
Author

@maryamariyan maryamariyan Mar 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will move this method to a new file since StringBuilder is already a partial class and this class is already long

}

[System.Diagnostics.Conditional("DEBUG")]
private void DebugDump()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make this return IEnumerable. That way you don't need to limit the number of chunks you return. As you expand in the watch window in the debugger, you see however many you want.

In the body you would use the yield keyword

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you point out you must then use #if DEBUG. Or better still you can conditionally include it in the project using Condition="'$(Configuration)' == 'Debug'" .. but nobody else seems to do that so maybe there's a reason not to do that.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would avoid the word Dump in the method name. But this being a debug-only print, maybe it is ok to have it like this?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont mind what it's called. ShowChunks ?

@danmoseley
Copy link
Copy Markdown
Member

Your "fixes" note at the top won't work as it's a CoreFX number. You should change it to
Fixes dotnet/corefx#27625

var debugText = new StringBuilder();
for (int i = 0; chunk != null && i < maxChunksToPrint; i++)
{
debugText.Insert(0, "|" + new string(chunk.m_ChunkChars));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ironically by doing Insert you are forcing StringBuilder to behave suboptimally...

You should be able to just Append repeatedly to add each char in the chunk.

Or if you want to get fancy (doesn't matter given it's debug code, but as a learning exercise) you can avoid creating a StringBuilder entirely and write directly into a string. See https://msdn.microsoft.com/en-us/magazine/mt814808.aspx and search for text "Now, not only have you avoided the allocation, you’re writing directly into the string’s memory on the heap"

}
debugText.Append("|");

Debug.WriteLine(debugText.ToString().Replace("\0", "."));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed instead of writing to Debug stream, I suggest yield return stringRepresentingAChunk;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once you're returning an enumreable, you won't need the | to separate blocks. Why not add a second debug method to give a single string with | separators, simply:

public string DebugDumpString()
{
     return string.Join("|", DebugDump());
}


public StringBuilder Clear()
{
Capacity = Math.Max(DefaultCapacity, Math.Min(Capacity, (int)(Length * 1.2)));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs a comment for your successor developers, eg something like

// Retain enough storage to store the same size content, plus 20% buffer, to help avoid reallocating during repeated use

We do'nt need to mention "and also because currently using Insert can lead to pathologically large allocations" ...

<Compile Include="$(MSBuildThisFileDirectory)System\Text\Latin1Encoding.cs" />
<Compile Include="$(MSBuildThisFileDirectory)System\Text\NormalizationForm.cs" />
<Compile Include="$(MSBuildThisFileDirectory)System\Text\StringBuilder.cs" />
<Compile Include="$(MSBuildThisFileDirectory)System\Text\StringBuilder.Debug.cs" />
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this file be included with a Condition for Debug Configuration only?

{
chars[i] = curChunksChars[i];
}
}).Replace('\0', '.');
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're doing replace to better visibly see the indices where char is empty in the chunk

<Compile Include="$(MSBuildThisFileDirectory)System\Text\Latin1Encoding.cs" />
<Compile Include="$(MSBuildThisFileDirectory)System\Text\NormalizationForm.cs" />
<Compile Include="$(MSBuildThisFileDirectory)System\Text\StringBuilder.cs" />
<Compile Condition="'$(Configuration)' == 'Debug'" Include="$(MSBuildThisFileDirectory)System\Text\StringBuilder.Debug.cs" />
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we usually put the condition after the file name.

Debug.WriteLine('|' + string.Join('|', ShowChunksInOrder(maxChunksToShow)) + '|');
}

private IEnumerable<string> ShowChunksInOrder(int maxChunksToShow)
Copy link
Copy Markdown
Author

@maryamariyan maryamariyan Mar 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didnt end up using yield because I am populating the IEnumerable from last to first before returning

@jkotas
Copy link
Copy Markdown
Member

jkotas commented Mar 15, 2018

Is this change a pure improvement across the board, or are there situations where this change is going to regress the behavior? If yes - what are those and how much are they regressing in the worst cast?

@danmoseley
Copy link
Copy Markdown
Member

Is this change a pure improvement across the board, or are there situations where this change is going to regress the behavior? If yes - what are those and how much are they regressing in the worst cast?

To answer that we would need to have a better understanding of when people call Clear() and what they do after that. @vancem do you have a sense of that?

@jkotas
Copy link
Copy Markdown
Member

jkotas commented Mar 15, 2018

Also, this change is changing observable behaviors, including exceptions to be thrown in cases where they were not thrown before. Consider:

var sb = new StringBuilder(14, 14);
sb.Append("Hello world!!!");
sb.Clear();

public StringBuilder Clear()
{
// Retain enough storage to store the same size content, plus 20% buffer, to help avoid reallocating during repeated use
Capacity = Math.Max(DefaultCapacity, Math.Min(Capacity, (int)(Length * 1.2)));
Copy link
Copy Markdown

@vancem vancem Mar 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider Clear() to be on the 'hot' path, so I care about it being 'lean' My concern her is that setting the capacity here forces it to potentially copy the data that is currently in the stringbuilder to its new home THEN clear it (which is clearly a waste)

This also does not fix the issue if someone uses sb.Length=0 (instead of Clear()).

Thus I would prefer if the change is in Length

                    if (chunk != this)
                    {
                        // Avoid possible infinite capacity growth.  See https://github.com/dotnet/coreclr/pull/16926
                        int capacityToPreserve = Math.Min(Capacity, Math.Max(Length * 6 / 5, m_ChunkChars.Length));

                        // We crossed a chunk boundary when reducing the Length. We must replace this middle-chunk with a new larger chunk,
                        // to ensure the original capacity is preserved.
                        int newLen = capacityToPreserve - chunk.m_ChunkOffset;

While you are in there I would also suggest putting an #if DEBUG around the call to

  •             int originalCapacity = Capacity;
    

Since this is on the critical path for Clear() and originalCapacity is only (now) used for Debug code and I am not sure that the JIT optimizer could remove the code associated with this statement completely.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tweeked the code above to change DefaultCapacity to m_ChunkChars.Length. I did this because it is best to preserve (as much as we can) the invariant that Capacity does not shrink (to preserve the Stringbuilder growing until a high water mark and then being efficient thereafter).

@vancem
Copy link
Copy Markdown

vancem commented Mar 16, 2018

Is this change a pure improvement across the board, or are there situations where this change is going to regress the behavior? If yes - what are those and how much are they regressing in the worst cast?
To answer that we would need to have a better understanding of when people call Clear() and what they do after that. @vancem do you have a sense of that?

In the code I suggested, I am only LOWER the Capacity (that is change current behavior) in pretty narrow circumstances. In particular it is never lowered below the last chunk size (which is typically the largest chunk).

I can come up with cases where we would be less efficient than before, but they involve insert() and they were probably not very efficient before either (we have not fixed that).

Copy link
Copy Markdown

@vancem vancem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. In summary, to keep the capacity from growing in an unbounded, way, we cap it to be 120% of the TOTAL length OR the capacity of the last chunk, whichever is larger.

The most interesting cases are when stringbuilders are reused. In this case unless 'insert' is used, you will end up with one chunk, and thus the capacity does not change (since capacity == m_chunkChars.Length). Thus there is no perf difference.

In a multi-buffer case (the ones that caused the bug), it caps it to the last chunk or 1.2X the length. This also feels benign.

Copy link
Copy Markdown

@vancem vancem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you removed the asserts, but think that is OK because the first assert is not interesting (we are not changing anything that would change capacity), and the second assert is simply no longer true (we are intentionally breaking it).

int numChunksToShow = 0;
StringBuilder lastChunk = null;
// Gets numChunksToShow. If numChunksToShow is larger than maxChunksToShow, then returns last chunk
GetNumChunksToShow(this, ref numChunksToShow, maxChunksToShow, ref lastChunk);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ref int is unusual - if you do not recurse you can simply make it a return value.

consider renaming lastChunk to headChunk since last is a bit ambiguous (it's first logically, last in the datastructure)

rather than ref parameters, consider returning a tuple (int numChunksToShow, StringBuilder headChunk)

private void GetNumChunksToShow(StringBuilder sb, ref int numToShow, int maxChunksToShow, ref StringBuilder lastChunk)
{
if (sb.m_ChunkPrevious != null)
GetNumChunksToShow(sb.m_ChunkPrevious, ref numToShow, maxChunksToShow, ref lastChunk);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I would not use recursion for something like this that can be done simply and easily without recursion, because it adds the risk of stack overflow (in this case there there are a huge number of chunks). Also it is nice to avoid ref parameters because they have a little more cognitive load. Eg.,

        private (int numChunksToShow, StringBuilder headChunk) GetNumChunksToShow(StringBuilder sb, int maxChunksToShow)
        {
            StringBuilder current = sb.m_ChunkPrevious;
            int count = 0;
            while (current != null && count < maxChunksToShow)
            {
                   count++;
                    current = current.m_ChunkPrevious;
            }

            return (count, current)
           }


namespace System.Text
{
public sealed partial class StringBuilder : ISerializable
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, interface implementation (same as base class derivation) need not be declared on all partial classes. In this case as ISerializable has nothing to do with this file, I would remove it. It's just more to maintain.

StringBuilder lastChunk = null;
// Gets numChunksToShow. If numChunksToShow is larger than maxChunksToShow, then returns last chunk
GetNumChunksToShow(this, ref numChunksToShow, maxChunksToShow, ref lastChunk);
Span<string> chunkChars = new string[numChunksToShow];
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does Span gain you here? It's just an array and used only as an array.

Span<string> chunkChars = new string[numChunksToShow];
var sb = lastChunk ?? this;

while (numChunksToShow > 0)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use a for loop?

sb = sb.m_ChunkPrevious;
numChunksToShow--;
}
return chunkChars.ToArray();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is calling ToArray on a Span that is wrapping an array.

// Avoid possible infinite capacity growth. See https://github.com/dotnet/coreclr/pull/16926
int capacityToPreserve = Math.Min(Capacity, Math.Max(Length * 6 / 5, m_ChunkChars.Length));

// We crossed a chunk boundary when reducing the Length. We must replace this middle-chunk with a new larger chunk,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment should probably be above the line you added above.

@maryamariyan
Copy link
Copy Markdown
Author

@dotnet-bot test Tizen armel Cross Checked Innerloop Build and Test please

@maryamariyan maryamariyan force-pushed the fix-stringbuilder-clear branch from 028f56b to 2246bf3 Compare March 22, 2018 18:19
{
int numChunks = 0;
StringBuilder current = this;
while (current != null)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could have done this in one loop, by starting off storing head = this and once numChunks > maxChunksToShow you know you have skipped enough, and each loop after that you update head = head.m_ChunkPrevious. When you have looped to the end, head has moved along numChunks - maxChunksToShow times, and is pointing to the chunk you want.

Once you have done that I would inline this method into the one above. Actually I would just have a single method for everything.

@maryamariyan maryamariyan force-pushed the fix-stringbuilder-clear branch from 551e283 to 622a9ea Compare March 22, 2018 19:48
current = current.m_ChunkPrevious;
}
current = head;
string[] chunks = new string[count];
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danmosemsft initially because you requested we have a method that returns IEnumerable, showing one chunk at a time, then I decided to create this string array.

But now I think you suggested to keep just one method for the entire ShowChunks() method. Therefore I won't need this string array and can then also use span to set the output.

}
current = head;
string[] chunks = new string[count];
for (int i = count; i > 0; i--)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, not worth resetting for, it would be more normal to do this because within the loop the index is then natural.

for (int i = count - 1; i >= 0; i--)
  chunks[i] ...

@SoftCircuits
Copy link
Copy Markdown

Shouldn't there be a very fast way to set StringBuilder.Length to zero without reallocating anything or changing the capacity?

I'm loading a big file and need to reset the StringBuilder on each line, but in most cases the existing capacity will work fine. Why do I need to add this unwanted overhead every time I call Clear() or set Length to zero?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants