Sync PerformanceCounterLib#26475
Conversation
|
@adiaaida |
|
ping |
|
@adiaaida @brianrob ? |
brianrob
left a comment
There was a problem hiding this comment.
Sorry for the delay on this. Some comments and questions.
| private Hashtable _categoryTable; | ||
| private Hashtable _nameTable; | ||
| private Hashtable _helpTable; | ||
| private readonly object _customCategoryTableLock = new Object(); |
There was a problem hiding this comment.
I would recommend only using InternalSyncObject. Introducing a new lock here can cause deadlocks if we don't get it right. I've seen that sort of thing in this area before and we've removed locks for this reason.
There was a problem hiding this comment.
Understood, but InternalSyncObject is static and protect race on PerformanceCounterLib static methods(create/destroy machine/lcid instances), "_customCategoryTableLock", _categoryTableLock, _nameTableLock, _helpTableLock protect race on machineName/lcid PerformanceCounterLib internal instance(GetCounters etc...), is correct mix "static locks" with "instance locks"?
There was a problem hiding this comment.
There is nothing fundamentally wrong with this. For me the question is whether or not the number of instances and actions upon them in a multi-threaded fashion will result in significant performance degradation if we use a single lock. My suspicion is that this won't be an issue because this is all about creation of counters and not use of counters. I would be much more concerned if we did something like this when reading counters.
| _categoryTable = null; | ||
| _customCategoryTable = null; | ||
| //race with FindCustomCategory | ||
| lock (_customCategoryTableLock) |
There was a problem hiding this comment.
Nit: Please add enclosing braces.
| Interlocked.CompareExchange(ref _customCategoryTable, new Hashtable(StringComparer.OrdinalIgnoreCase), null); | ||
| } | ||
| if (_customCategoryTable == null) | ||
| _customCategoryTable = new Hashtable(StringComparer.OrdinalIgnoreCase); |
There was a problem hiding this comment.
Nit: Please add enclosing braces.
| DateTime now = DateTime.UtcNow; | ||
| while ((DateTime.UtcNow.Subtract(now).TotalSeconds < 10)) | ||
| { | ||
| CloseAllLibraries(); |
There was a problem hiding this comment.
It looks like CloseAllTables got removed. Is that intentional? I'd expect that you'd want completely clean state.
There was a problem hiding this comment.
CloseAllLibraries() call library.Close() that call CloseTables(), CloseAllTables() call CloseTable() for every library.
| */ | ||
| bool isPublished = false; | ||
| DateTime now = DateTime.UtcNow; | ||
| while ((DateTime.UtcNow.Subtract(now).TotalSeconds < 10)) |
There was a problem hiding this comment.
I feel reasonably good about adding locks here, but I don't think that we should add a 10 second retry here. If you want to add a couple of retries, that might be OK. I also don't think that I would throw an exception either. The only real thing that can be done in the case of the thrown exception is to wait longer, and so you'd expect the user of performance counters to do that when attempting to get the counter - users already have to do this, right?
There was a problem hiding this comment.
As i have written above the idea here is "wait for publication" not to protect access to shared resources, let me understand why lock is useful?If i understood well the issue is not race, but this is a "random" registry publication delay(read on guide) so after some time(10 seconds or less?or maybe after n retry is better?) if performance counter does not exists something went wrong or someone removed same counter. What i mean is that is not sure "wait longer" is the only solution, do you agree?
There was a problem hiding this comment.
I see your point - It's just not clear to me what the right number of retries or the correct amount of time is here. That's why I would guess that the desktop implementation does not have this type of logic and instead depends upon the caller to decide what to do (not necessarily better, but more flexible).
@weshaggard, do you know if there is any precedent in managed APIs that wrap Windows OS APIs where delay is an issue (such as this registry publication delay)? Do we generally take the approach that callers need to do retries, or do we have a pattern that we've used before to attempt to solve this problem?
There was a problem hiding this comment.
@stephentoub could answer this. I thought there are some examples (maybe in IO?) but I can't find one specifically so that may not be the case.
There was a problem hiding this comment.
Thanks @danmosemsft. @stephentoub do you have any thoughts on this?
There was a problem hiding this comment.
Just to be clear, this is not retries, this is 'wait for publication'. Previously the API was asynchronous (you set things in motion, but state update may not have happened before return), and you have made the synchronous (which gets you into the business of deciding how long to wait, and what to do when that fails.
We ARE changing the contract (it can throw an exception now where it definitely did NOT before), so this is a compatibility break (how much do we care?)
This is also a 'compatibility' library (it is not part of the cross platform core), and my expectation is that we are not trying to innovate here (just make existing things work).
This suggests that we should not change things here. What are the disadvantages of simply leaving the old behavior?
There was a problem hiding this comment.
@vancem i agree with you this is 'wait for publication' and the idea is to mitigate the issue. The scope of this PR is for first resolve some race issues and now seems ok. The second is try to remove [assembly:CollectionBehavior(DisableTestParallelization = true)] to run tests concurrently, but maybe this is not fundamental. @danmosemsft added this attribute after some tests fail due to "wait for publication" side effect and internal cleanup/load cached tables. If we don't want to change too much on this namespace we can keep the attribute.
There was a problem hiding this comment.
My recommendation is to pull out the wait loop and do that separately. If the change was only motivated by tests, it is easy enough to simply do that waiting in the test rather than here. Which we choose really depends on whether our customers would be happier with or without it (but we are biased to doing nothing if we don't know). I am suspicious that we don't know, so I would keep the semantics the same (less delta between us an .NET Desktop).
So my recommendation is to pull it (we can put it back if we care in the future...
There was a problem hiding this comment.
ok @vancem i'll remove 'wait for publication strategy' and fix only race, thank's! @danmosemsft i think we need to keep attribute(or expose IsPublished() to category and update all tests if it's so important).
| [SkipOnTargetFramework(TargetFrameworkMonikers.Uap)] // In appcontainer, cannot write to perf counters | ||
| public static class PerformanceCounterCategoryTests | ||
| { | ||
| #if MyTrait |
There was a problem hiding this comment.
Looks like MyTrait was left in inadvertently.
There was a problem hiding this comment.
As i have written above this is test/debug code, this test will be removed, or do you think we need an outerloop test of some seconds?
There was a problem hiding this comment.
I see. I do like the idea of your multi-threaded test making making sure that we haven't introduced any deadlocks with this (or future) changes. So, if you were to modify your test just to make sure that it actually completes within a reasonable time period, I think that would be a good addition to the test suite.
There was a problem hiding this comment.
ok if the idea of some sort of "wait" will be accepted i'll refactor an [OuterLoop()] test(for instance create/destroy n performance counter concurrently).
|
If you are successful, we can include a change to reverse this? Then repeatedly run CI to see whether we can still hit a problem. |
|
@brianrob should be ok now, you can review, thank's! |
|
@dotnet-bot test NETFX x86 Release Build |
1 similar comment
|
@dotnet-bot test NETFX x86 Release Build |
|
@brianrob, is this PR good to merge? |
| s_libraryTable = null; | ||
| } | ||
| } | ||
| //race with GetPerformanceCounterLib |
There was a problem hiding this comment.
Does this locking actually make a material difference in race conditions with GetPerformanceCounterLib? GetPerformanceCounterLib doesn't take the lock on InternalSyncObj if s_libraryTable is already initialized. And even if it did, it would only take the lock long enough to initialize the table, but it would be handing out a reference to a PerformanceCounterLib that we're closing here, so code could be using a PerformanceCounterLib concurrently with it being Close'd. Is that ok? Or maybe the locking we're adding here is purely about protecting the mutation of s_libraryTable, and we're not concerned with the other issues?
There was a problem hiding this comment.
Or maybe the locking we're adding here is purely about protecting the mutation of s_libraryTable, and we're not concerned with the other issues?
Yes @stephentoub testing with parallel Register/Unregister(CloseAllLibraries) and GetCounterXXX i see a lot of new PerformanceCounterLib(machineName, lcidString); on GetPerformanceCounterLib for s_libraryTable = null; this lock is only to protect mutation s_libraryTable and avoid a lot of benign instances of PerformanceCounterLib in parallel scenario
but it would be handing out a reference to a PerformanceCounterLib that we're closing here
doesn't seem to be a problem because the "registry tables" will be reloaded, if counter is no more present a not found exception will raise.
Again is a borderline scenario(parallel register/unregister).
| categoryType = PerformanceCounterCategoryType.Unknown; | ||
|
|
||
| if (_customCategoryTable == null) | ||
| //race with CloseTables |
There was a problem hiding this comment.
This is a lot of code to put under a lock; this is not only synchronizing with CloseTables, but with other calls to FindCustomCategory. Might that be a performance/scalability problem?
I would think a better way to deal with this would be to instead just change this method to not refer to _customCategoryTable other than for the initial checking/setting of the field, e.g.
Hashtable table =
_customCategoryTable ??
Interlocked.CompareExchange(ref _customCategoryTable, new Hashtable(StringComparer.OrdinalIgnoreCase), null) ??
_customCategoryTable;and then use table for the rest of the method rather than using _customCategoryTable, only taking a lock (table) when we actually modify table, since Hashtable is thread-safe for any number of readers with one writer.
There was a problem hiding this comment.
Might that be a performance/scalability problem?
i followed internal FindCustomCategory() and is used only in init path today, no hot path.
However your solution is better, so i agree with you. Thank's a lot.
|
@dotnet-bot test OSX x64 Debug Build |
|
@dotnet-bot test Windows x86 Release Build |
|
Looks like this one is ready to merge. @stephentoub do you have any other feedback? |
| lock (_customCategoryTableLock) | ||
| { | ||
| _customCategoryTable = null; | ||
| } |
There was a problem hiding this comment.
Why is this lock still needed?
There was a problem hiding this comment.
i understand, i'll remove it. Thanks!
| // | ||
| categoryType = PerformanceCounterCategoryType.Unknown; | ||
| _customCategoryTable[category] = categoryType; | ||
| lock (_customCategoryTableLock) |
There was a problem hiding this comment.
You shouldn't need a separate _customCategoryTableLock... just lock on table.
There was a problem hiding this comment.
you'are right. Thanks!
| { | ||
| RegisterFiles(categoryName, true); | ||
| DeleteRegistryEntry(categoryName); | ||
| CloseAllTables(); |
There was a problem hiding this comment.
Why is CloseAllTables no longer needed?
There was a problem hiding this comment.
CloseAllLibraries() call Close() on every PerformanceCounterLib instance that call CloseTables(). Isn't this redundant?
There was a problem hiding this comment.
I don't know. I'm not familiar with the codebase.
There was a problem hiding this comment.
I'm inclined to say that we should leave CloseAllTables in place. I agree that the work is done implicitly by calling CloseAllLibraries, but some of this work is done outside of locks and I'd hate to change the characteristics of a race condition and break user code. Given that this functionality is here for compatibility, I'd leave it.
|
@dotnet-bot test Linux x64 Release Build please |
|
@dotnet-bot test OSX x64 Debug Build please |
1 similar comment
|
@dotnet-bot test OSX x64 Debug Build please |
|
@brianrob do you have further feedback? I'd like to get this committed to see whether it helps tests stop being flaky. |
|
@danmosemsft, I just replied back to the last comment. Once the requested changes are made as long as @stephentoub is OK with the changes, I am too. |
|
@danmosemsft @brianrob @stephentoub restored CloseAllTables(). |
|
@dotnet-bot test NETFX x86 Release Build |
* sync PerformanceCounterLib * add enclosing braces * address PR feedback * address PR feedback * address PR feedback * nit: trim spaces * address PR feedback * address stephentoub PR feedback * nit: tab * nit: tab * address PR feedback Commit migrated from dotnet/corefx@41c8ba2

Closes #25403
I tried to understand where is the problem and after tests:
the new reloaded CategoryTable sometimets doesn't contains new counter as if all modification to registry are not yet published to all threads
(by design read guide note https://msdn.microsoft.com/en-us/library/cs38wsc4(v=vs.110).aspx "If the list has not been refreshed, the attempt to use the category will fail").
So i added a "retry" strategy to wait new counter publication IsPublished() to mitigate the issue.
(the other table are re-loaded from registry)
race with CloseAllLibraries() on s_libraryTable
My idea is that this namespace wasn't built to handle special "concurrent" scenarios where Create()/Delete()/GetXXXX() are used heavily at same time,
and it make sense, usually i "install a counter" and after i use it, so concurrent modification are rare.
Parallel unit testing is borderline.
Maybe if we want to have a fully concurrent api we've to change more code and have more "coarse" locks also to protect concurrent registry access, or we can
only sync internal structure of PerformanceCounterLib and keep attribute.
I think that today [assembly: CollectionBehavior(DisableTestParallelization = true)] is necessary with this implementation to mitigate the issue.
After these changes sometimes(after heavy loaded minutes with 64 parallel thread in loop Exists/Delete/Create/GetCounter/ReadValue) i get corrupted registry error like(no more null reference for race):
This is WIP code for discussion(tests/debug code).
cc: @joperezr @danmosemsft @adiaaida