Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Add String.Split overloads that take a single char and string separator#895

Merged
weshaggard merged 1 commit into
dotnet:masterfrom
justinvp:string_splitoverloads
Aug 24, 2016
Merged

Add String.Split overloads that take a single char and string separator#895
weshaggard merged 1 commit into
dotnet:masterfrom
justinvp:string_splitoverloads

Conversation

@justinvp
Copy link
Copy Markdown

@justinvp justinvp commented May 1, 2015

Fixes dotnet/corefx/issues/1513

Notes:

  • Tests are here: Add more String.Split tests corefx#1600, although, the tests in corefx won't exercise these new overloads until the new methods are exposed in System.Runtime.dll (the tests are written in a way to "light up" once the methods are available). In the meantime, I ran the Split tests manually in a one-off app compiled against a CoreCLR mscorlib.dll with the added methods (on Windows).
  • I made the changes as "surgical" as possible to minimize churn. There are some opportunities for stylistic code formatting improvements, but I held off on those for this PR.

/cc @ellismg

Comment thread src/mscorlib/src/System/String.cs Outdated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be SecurityCritical, the safety of this methods depends on up stack code being correct (i.e. if you pass in a bad seperatorsLength you violate memory safety).

@ellismg
Copy link
Copy Markdown

ellismg commented May 1, 2015

Thanks, @justinvp. This looks like a great first start. I'm going to want to take another in-depth look (as well as review the tests) before signing off. I'm especially interested in understanding if doing something with unsafe code would end up being better than the StringSeperatorArray thing. Did you happen to play around with that?

We also need to figure out the process for taking this type of change (e.g. do we have a future branch?) so it will probably be a few days before I'm comfortable merging this. I'm already working internally to figure out what the right strategy is.

@justinvp justinvp force-pushed the string_splitoverloads branch from d88b61d to 862da1e Compare May 1, 2015 01:27
Comment thread src/mscorlib/src/System/String.cs Outdated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this field be eliminated if you place constraints on the other two fields? (e.g. _separator != null implies 'single')

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is _separator can be null, that's a valid value.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you construct this you could convert a null separator to string.Empty, and I think that would give you the behavior you want?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, @ellismg, I see what you mean. Since separators that are null or empty are treated the same by Split, string.Empty can be set so that _separator != null implies 'single'. Thanks for the feedback guys!

@justinvp
Copy link
Copy Markdown
Author

justinvp commented May 1, 2015

Thanks for the initial feedback, @ellismg and @pharring!

I'm especially interested in understanding if doing something with unsafe code would end up being better than the StringSeperatorArray thing. Did you happen to play around with that?

I went with the SplitSeparatorArray approach for the string-based separators as a slimmed down version of ParamsArray (used by String.Format) to minimize the churn and share as much of the implementation as possible between string and string[] per @davkean's recommendation.

I haven't had a chance to run before/after comparisons to see what the perf difference is, but I'll do that and let you know. And I'll look into the unsafe approach.

@justinvp justinvp force-pushed the string_splitoverloads branch from 862da1e to 8adea6c Compare May 1, 2015 05:34
@justinvp justinvp closed this May 5, 2015
@justinvp justinvp force-pushed the string_splitoverloads branch from 8adea6c to 1827eb2 Compare May 5, 2015 02:40
@davkean
Copy link
Copy Markdown
Member

davkean commented May 5, 2015

Hey, what happened to this?

@justinvp
Copy link
Copy Markdown
Author

justinvp commented May 5, 2015

Sorry, I was trying to replace the last commit with another commit from another branch and GitHub ended up closing the PR automatically. I'll try to reopen, if not, I'll open a new PR.

@justinvp
Copy link
Copy Markdown
Author

justinvp commented May 5, 2015

@ellismg,

The unsafe approach for string-based separators looks to be more trouble than it's worth. As far as I know, there's no way to stackalloc a string*. Doing this with unsafe would involve manually keeping track of a GCHandle for each string via GCHandle.Alloc(separator, GCHandleType.Pinned) (saving the GCHandles in a stackalloc'd array?), performing the split operation, and then calling GCHandle.Free for each string at the end. (Let me know if you know of a better way to go about the unsafe approach).

As an alternative, I got rid of the SplitSeparatorArray and added a MakeSeparatorList method just for a single separator. PR updated. This is simpler than the unsafe approach described above and more performant than the previous approach. If we decide to add overloads for 2-3 separators, then SplitSeparatorArray (or something like it) could be brought back, and it could have it's own MakeSeparatorList method, as to not disrupt the performance/locality of the MakeSeparatorList for string[].

As an aside, it'd be nice to investigate whether it's possible to get rid of or minimize/mitigate the int[] allocations for sepList and lengthList, while not compromising perf. Perhaps using stackalloc if the length isn't too large, falling back to a heap allocation for larger strings, or a different approach altogether. I'd be interested in exploring this further.

Comment thread src/mscorlib/src/System/String.cs Outdated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete. lengthList isn't a parameter.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks.

@justinvp justinvp force-pushed the string_splitoverloads branch from 1130946 to 32554a8 Compare May 5, 2015 03:48
@justinvp justinvp changed the title Add String.Split overloads that take a single char and string separator [WIP] Add String.Split overloads that take a single char and string separator May 5, 2015
@ellismg
Copy link
Copy Markdown

ellismg commented May 5, 2015

The unsafe approach for string-based separators looks to be more trouble than it's worth.

Sounds good, thanks for giving it a shot.

As an aside, it'd be nice to investigate whether it's possible to get rid of or minimize/mitigate the int[] allocations for sepList and lengthList, while not compromising perf. Perhaps using stackalloc if the length isn't too large, falling back to a heap allocation for larger strings, or a different approach altogether. I'd be interested in exploring this further.

This is something I want to explore as well. It's probably reasonable to bank the wins from this new overload and do the investigation as a separate PR. I think the best place to start is writing some benchmarks where we could measure both time and GC. Right now, the state of the art for doing this seems to be:

  1. Write some xunit tests for benchmarks
  2. Run each test individually, using PerfView to collect data for the run.
  3. Make changes
  4. Collect new data with PerfView and diff them.

We are working on getting a system set up so you just write benchmarks in XUnit and everything else is taken care of (and run in the CI system so we ensure we don't regress) but we probably won't be there for a few months.

If you want to take the lead here, I'm happy to let you do so, and we can bounce ideas off one another (perhaps in another issue or PR?)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be

internal unsafe String[] SplitInternal(char[] separator, int count, StringSplitOptions options)

?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be. I don't have a preference either way.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

less brackets - less indentation - better readability

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flip side is this:

If unsafe is an implementation detail then why mark the methods unsafe?

A moot point, but I digress.

Comment thread src/mscorlib/src/System/String.cs Outdated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this better as Contract.Assert(!string.IsNullOrEmpty(separator), $"!string.IsNullOrEmpty({nameof(separator)})"); or is string interpolation not available?

Also, you seem to bounce between String and string, which is "correct"?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String interpolation can't be used as mscorlib still needs to be compiled with the older C# compiler.

As for String vs. string, the code style in this file is inconsistent all over the place, but String would probably be the most consistent with other uses in this file (this is in contrast with the corefx repo where string is preferred).

@ghost
Copy link
Copy Markdown

ghost commented Aug 30, 2015

We are working on getting a system set up so you just write benchmarks in XUnit and everything else is taken care of (and run in the CI system so we ensure we don't regress) but we probably won't be there for a few months.

The community is very interested in addressing the performance aspect of .NET Core.
More PRs of this sort are showing up #1460, #137 and getting queued (#1241 (comment)).
Please expedite bringing the benchmark system, so these pending PRs get handled.

@NickCraver
Copy link
Copy Markdown
Member

+1 on merging this in. We maintain a SplitSplits constant array set on every project to side-step the needless array allocation here. The allocations can really pile up in high-traffic environments (e.g. Stack Overflow was doing billions a day before that optimization) and honestly the workaround feels silly - what's the status on getting these into the BCL?

@justinvp
Copy link
Copy Markdown
Author

The main thing I've been waiting on is whether this needs to be resubmitted against (a yet to be created) "future" branch. Though, perhaps this PR can be merged into coreclr/master, with the actual reference assembly contract changes submitted separately against corefx/future?

@ellismg, What do you think?

(Also, I won't be able to get to this until next week when I'm back in front of a computer, but I want to ensure the additions to model.xml here are in the "right" order, to be consistent with the rest of the file.)

@justinvp justinvp force-pushed the string_splitoverloads branch from 32554a8 to b454cee Compare November 13, 2015 05:50
@justinvp justinvp changed the title [WIP] Add String.Split overloads that take a single char and string separator Add String.Split overloads that take a single char and string separator Nov 13, 2015
@justinvp justinvp mentioned this pull request Nov 13, 2015
@ghost
Copy link
Copy Markdown

ghost commented Jan 12, 2016

@justinvp, can a similar overload with single char added for string.Join too? The current approach is to call ToString() on char, which is bit inefficient, ex. string.join(Path.PathSeparator.ToString(), ..).

@jamesqo
Copy link
Copy Markdown

jamesqo commented Feb 20, 2016

Just curious, is this still going to be merged? It's been a few months since the last commit...

@weshaggard
Copy link
Copy Markdown
Member

@jasonwilliams200OK We should keep this PR focused on the APIs covered with https://github.com/dotnet/corefx/issues/1513. Another one should be filed for additional public APIs.

@justinvp I think this change can go into coreclr\master but as you point out we cannot expose them in System.Runtime contract yet until these changes can actually be brought into the full .NET Framework as well so we would need to either create a future version of System.Runtime contract ref or put it into the corefx/future branch (I need to figure out how we add such APIs on core types in general as we are still working out the whole netstandard stuff).

@ellismg @AlexGhiondea are one you going to follow-up about getting these added to the .NET Framework?

@ghost
Copy link
Copy Markdown

ghost commented Feb 22, 2016

@weshaggard, thanks. The String.Split issue was filed here: https://github.com/dotnet/corefx/issues/5552 followed by the PR #2945. :)

@justinvp justinvp force-pushed the string_splitoverloads branch from b454cee to 7925348 Compare March 23, 2016 07:49
@NickCraver
Copy link
Copy Markdown
Member

Is there any update here? I see the last build breaking. I'm assuming this falls under the next generation, so are we just waiting until post-RC2 or RTM to consider PRs like this?

@AlexGhiondea
Copy link
Copy Markdown

I think this needs to go through API Review. @terrajobst can you help schedule this?

@justinvp
Copy link
Copy Markdown
Author

justinvp commented Apr 6, 2016

This has already been API reviewed and approved.

@terrajobst said:

We reviewed this issue today. It looks good as proposed.

https://github.com/dotnet/corefx/issues/1513#issuecomment-97279502

@ellismg
Copy link
Copy Markdown

ellismg commented Apr 6, 2016

The plan is to do this, but we are going to wait to post 1.0

@justinvp
Copy link
Copy Markdown
Author

@ellismg, @jkotas, is this an appropriate time to get this reviewed/merged? https://github.com/dotnet/corefx/issues/2578 (another new System.Runtime API) was just merged. This is https://github.com/dotnet/corefx/issues/1513, which was approved by API review. After this is reviewed and looks good, I can port to CoreRT.

@ghost
Copy link
Copy Markdown

ghost commented Jul 18, 2016

(and please this very similar #2945 as well 😊)

@jkotas
Copy link
Copy Markdown
Member

jkotas commented Jul 18, 2016

Let's see the one I have merged to go through end-to-end to see how it is going to work.

@jamesqo
Copy link
Copy Markdown

jamesqo commented Aug 19, 2016

Any progress on this? I am thinking of making changes to Split, however I'm hesitant to do so until this gets in to avoid any merge conflicts.

@justinvp
Copy link
Copy Markdown
Author

justinvp commented Aug 19, 2016

Some ideas:

  1. We could merge this as-is (modulo feedback). The new APIs would be public in System.Private.CoreLib.dll, but won't be available until they're exposed in the System.Runtime.dll contract at some point in the future.
  2. I remove the model.xml changes from this PR and we merge (modulo feedback). The new APIs will exist in the code, but will be removed during the BCL rewrite/thinner process, so they won't be present in System.Private.CoreLib.dll at all. Then a PR in the future can expose them in model.xml, followed by another PR to expose them in the System.Runtime.dll contract, when new APIs are allowed.
  3. I could open a new PR with just the internal changes to existing methods, and we merge that (modulo feedback). Then update this PR to add the new APIs on top of that PR.
  4. You just go ahead with your perf changes that conflict with this, and at some point I deal with the merge conflicts.

@jamesqo
Copy link
Copy Markdown

jamesqo commented Aug 21, 2016

@justinvp I think you should just go ahead with this PR (1). I haven't actually made any changes to Split yet, and I wouldn't want to keep this from being merged any longer. I can send in my pull request after that's done.

Great job on adding these, btw :)

@weshaggard
Copy link
Copy Markdown
Member

@justinvp thanks for all your work and patience on this. I chatted with @jkotas and we think going with your option (1) is the right approach to make some progress. So for APIs that are approved and reviewed we will allow them to be added to the implementation but we need to file a tracking issue in corefx to expose them in the reference assemblies and to provide tests for them at a later time.

@justinvp
Copy link
Copy Markdown
Author

Thanks, @weshaggard & @jkotas!

provide tests for them at a later time

In this particular case, I've already written the tests so that they automatically "light-up" when the APIs are exposed in the reference assembly, via extension methods, so the only test-related action will be to delete those extension methods once the APIs are exposed.

@weshaggard
Copy link
Copy Markdown
Member

LGTM. Thanks @justinvp.

@weshaggard weshaggard merged commit a352d94 into dotnet:master Aug 24, 2016
@jamesqo
Copy link
Copy Markdown

jamesqo commented Aug 24, 2016

Awesome! 😄

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add String.Split overloads that take a single char and string separator