Sync PerformanceCounterLib by MarcoRossignoli · Pull Request #26475 · dotnet/corefx

MarcoRossignoli · 2018-01-20T07:58:54Z

Closes #25403

I tried to understand where is the problem and after tests:

at the end of RegisterCategory() CloseAllLibraries() clear PerformanceCounterLib instances but if at the same time we call PerformanceCounterLib.GetXXXX()
the new reloaded CategoryTable sometimets doesn't contains new counter as if all modification to registry are not yet published to all threads
(by design read guide note https://msdn.microsoft.com/en-us/library/cs38wsc4(v=vs.110).aspx "If the list has not been refreshed, the attempt to use the category will fail").
So i added a "retry" strategy to wait new counter publication IsPublished() to mitigate the issue.
In DeleteCategory() the call PerformanceCounterLib.CloseAllLibraries() is redundant because is called in PerformanceCounterLib.UnregisterCategory().
On CloseAllLibraries() i add lock because concurrent GetPerformanceCounterLib() raise NullReferenceException() on s_libraryTable
Add lock to protect _customCategoryTable, also on CloseTables() because raise NullReferenceException() on FindCustomCategory()
(the other table are re-loaded from registry)
Add more "coarse" lock on GetPerformanceCounterLib() for two reason, possible multiple instancing of PerformanceCounterLib and
race with CloseAllLibraries() on s_libraryTable

My idea is that this namespace wasn't built to handle special "concurrent" scenarios where Create()/Delete()/GetXXXX() are used heavily at same time,
and it make sense, usually i "install a counter" and after i use it, so concurrent modification are rare.
Parallel unit testing is borderline.
Maybe if we want to have a fully concurrent api we've to change more code and have more "coarse" locks also to protect concurrent registry access, or we can
only sync internal structure of PerformanceCounterLib and keep attribute.
I think that today [assembly: CollectionBehavior(DisableTestParallelization = true)] is necessary with this implementation to mitigate the issue.
After these changes sometimes(after heavy loaded minutes with 64 parallel thread in loop Exists/Delete/Create/GetCounter/ReadValue) i get corrupted registry error like(no more null reference for race):

The Counter layout for the Category specified is invalid, a counter of the type:  AverageCount64, AverageTimer32, CounterMultiTimer, CounterMultiTimerInverse, CounterMultiTimer100Ns, CounterMultiTimer100NsInverse, RawFraction, or SampleFraction has to be immediately followed by any of the base counter types: AverageBase, CounterMultiBase, RawBase or SampleBase.
Cannot load Counter Name data because an invalid index '' was read from the registry.

This is WIP code for discussion(tests/debug code).

cc: @joperezr @danmosemsft @adiaaida

danmoseley · 2018-01-20T08:51:48Z

@adiaaida

MarcoRossignoli · 2018-02-16T17:14:48Z

ping

danmoseley · 2018-02-16T17:34:59Z

@adiaaida @brianrob ?

brianrob

Sorry for the delay on this. Some comments and questions.

brianrob · 2018-02-21T19:32:21Z

        private Hashtable _categoryTable;
        private Hashtable _nameTable;
        private Hashtable _helpTable;
+        private readonly object _customCategoryTableLock = new Object();


I would recommend only using InternalSyncObject. Introducing a new lock here can cause deadlocks if we don't get it right. I've seen that sort of thing in this area before and we've removed locks for this reason.

Understood, but InternalSyncObject is static and protect race on PerformanceCounterLib static methods(create/destroy machine/lcid instances), "_customCategoryTableLock", _categoryTableLock, _nameTableLock, _helpTableLock protect race on machineName/lcid PerformanceCounterLib internal instance(GetCounters etc...), is correct mix "static locks" with "instance locks"?

There is nothing fundamentally wrong with this. For me the question is whether or not the number of instances and actions upon them in a multi-threaded fashion will result in significant performance degradation if we use a single lock. My suspicion is that this won't be an issue because this is all about creation of counters and not use of counters. I would be much more concerned if we did something like this when reading counters.

i agree, seems ok, checked with call hierarchy tool, called only in init paths(_initialized).

brianrob · 2018-02-21T19:32:36Z

            _categoryTable = null;
-            _customCategoryTable = null;
+            //race with FindCustomCategory
+            lock (_customCategoryTableLock)


Nit: Please add enclosing braces.

brianrob · 2018-02-21T19:32:45Z

-                Interlocked.CompareExchange(ref _customCategoryTable, new Hashtable(StringComparer.OrdinalIgnoreCase), null);
-            }
+                if (_customCategoryTable == null)
+                    _customCategoryTable = new Hashtable(StringComparer.OrdinalIgnoreCase);


Nit: Please add enclosing braces.

brianrob · 2018-02-21T19:33:38Z

+                DateTime now = DateTime.UtcNow;
+                while ((DateTime.UtcNow.Subtract(now).TotalSeconds < 10))
+                {
+                    CloseAllLibraries();


It looks like CloseAllTables got removed. Is that intentional? I'd expect that you'd want completely clean state.

CloseAllLibraries() call library.Close() that call CloseTables(), CloseAllTables() call CloseTable() for every library.

Ok, thank you.

brianrob · 2018-02-21T19:36:39Z

+                 */
+                bool isPublished = false;
+                DateTime now = DateTime.UtcNow;
+                while ((DateTime.UtcNow.Subtract(now).TotalSeconds < 10))


I feel reasonably good about adding locks here, but I don't think that we should add a 10 second retry here. If you want to add a couple of retries, that might be OK. I also don't think that I would throw an exception either. The only real thing that can be done in the case of the thrown exception is to wait longer, and so you'd expect the user of performance counters to do that when attempting to get the counter - users already have to do this, right?

As i have written above the idea here is "wait for publication" not to protect access to shared resources, let me understand why lock is useful?If i understood well the issue is not race, but this is a "random" registry publication delay(read on guide) so after some time(10 seconds or less?or maybe after n retry is better?) if performance counter does not exists something went wrong or someone removed same counter. What i mean is that is not sure "wait longer" is the only solution, do you agree?

I see your point - It's just not clear to me what the right number of retries or the correct amount of time is here. That's why I would guess that the desktop implementation does not have this type of logic and instead depends upon the caller to decide what to do (not necessarily better, but more flexible).

@weshaggard, do you know if there is any precedent in managed APIs that wrap Windows OS APIs where delay is an issue (such as this registry publication delay)? Do we generally take the approach that callers need to do retries, or do we have a pattern that we've used before to attempt to solve this problem?

@stephentoub could answer this. I thought there are some examples (maybe in IO?) but I can't find one specifically so that may not be the case.

Thanks @danmosemsft. @stephentoub do you have any thoughts on this?

Just to be clear, this is not retries, this is 'wait for publication'. Previously the API was asynchronous (you set things in motion, but state update may not have happened before return), and you have made the synchronous (which gets you into the business of deciding how long to wait, and what to do when that fails.

We ARE changing the contract (it can throw an exception now where it definitely did NOT before), so this is a compatibility break (how much do we care?)

This is also a 'compatibility' library (it is not part of the cross platform core), and my expectation is that we are not trying to innovate here (just make existing things work).

This suggests that we should not change things here. What are the disadvantages of simply leaving the old behavior?

@vancem i agree with you this is 'wait for publication' and the idea is to mitigate the issue. The scope of this PR is for first resolve some race issues and now seems ok. The second is try to remove [assembly:CollectionBehavior(DisableTestParallelization = true)] to run tests concurrently, but maybe this is not fundamental. @danmosemsft added this attribute after some tests fail due to "wait for publication" side effect and internal cleanup/load cached tables. If we don't want to change too much on this namespace we can keep the attribute.

My recommendation is to pull out the wait loop and do that separately. If the change was only motivated by tests, it is easy enough to simply do that waiting in the test rather than here. Which we choose really depends on whether our customers would be happier with or without it (but we are biased to doing nothing if we don't know). I am suspicious that we don't know, so I would keep the semantics the same (less delta between us an .NET Desktop).

So my recommendation is to pull it (we can put it back if we care in the future...

ok @vancem i'll remove 'wait for publication strategy' and fix only race, thank's! @danmosemsft i think we need to keep attribute(or expose IsPublished() to category and update all tests if it's so important).

Thanks @MarcoRossignoli!

brianrob · 2018-02-21T19:37:06Z

    [SkipOnTargetFramework(TargetFrameworkMonikers.Uap)] // In appcontainer, cannot write to perf counters
    public static class PerformanceCounterCategoryTests
    {
+#if MyTrait


Looks like MyTrait was left in inadvertently.

As i have written above this is test/debug code, this test will be removed, or do you think we need an outerloop test of some seconds?

I see. I do like the idea of your multi-threaded test making making sure that we haven't introduced any deadlocks with this (or future) changes. So, if you were to modify your test just to make sure that it actually completes within a reasonable time period, I think that would be a good addition to the test suite.

ok if the idea of some sort of "wait" will be accepted i'll refactor an [OuterLoop()] test(for instance create/destroy n performance counter concurrently).

Makes sense.

danmoseley · 2018-02-23T18:49:38Z

If you are successful, we can include a change to reverse this?
https://github.com/dotnet/corefx/pull/25401/files

Then repeatedly run CI to see whether we can still hit a problem.

MarcoRossignoli · 2018-03-07T10:47:16Z

@brianrob should be ok now, you can review, thank's!

MarcoRossignoli · 2018-03-07T13:29:27Z

@dotnet-bot test NETFX x86 Release Build

MarcoRossignoli · 2018-03-08T11:49:33Z

@dotnet-bot test NETFX x86 Release Build

ahsonkhan · 2018-03-10T02:18:51Z

@brianrob, is this PR good to merge?

stephentoub · 2018-03-14T13:41:05Z

-                s_libraryTable = null;
-            }
-        }
+                //race with GetPerformanceCounterLib


Does this locking actually make a material difference in race conditions with GetPerformanceCounterLib? GetPerformanceCounterLib doesn't take the lock on InternalSyncObj if s_libraryTable is already initialized. And even if it did, it would only take the lock long enough to initialize the table, but it would be handing out a reference to a PerformanceCounterLib that we're closing here, so code could be using a PerformanceCounterLib concurrently with it being Close'd. Is that ok? Or maybe the locking we're adding here is purely about protecting the mutation of s_libraryTable, and we're not concerned with the other issues?

Or maybe the locking we're adding here is purely about protecting the mutation of s_libraryTable, and we're not concerned with the other issues?

Yes @stephentoub testing with parallel Register/Unregister(CloseAllLibraries) and GetCounterXXX i see a lot of new PerformanceCounterLib(machineName, lcidString); on GetPerformanceCounterLib for s_libraryTable = null; this lock is only to protect mutation s_libraryTable and avoid a lot of benign instances of PerformanceCounterLib in parallel scenario

but it would be handing out a reference to a PerformanceCounterLib that we're closing here

doesn't seem to be a problem because the "registry tables" will be reloaded, if counter is no more present a not found exception will raise.
Again is a borderline scenario(parallel register/unregister).

stephentoub · 2018-03-14T13:48:10Z

            categoryType = PerformanceCounterCategoryType.Unknown;

-            if (_customCategoryTable == null)
+            //race with CloseTables


This is a lot of code to put under a lock; this is not only synchronizing with CloseTables, but with other calls to FindCustomCategory. Might that be a performance/scalability problem?

I would think a better way to deal with this would be to instead just change this method to not refer to _customCategoryTable other than for the initial checking/setting of the field, e.g.

Hashtable table = _customCategoryTable ?? Interlocked.CompareExchange(ref _customCategoryTable, new Hashtable(StringComparer.OrdinalIgnoreCase), null) ?? _customCategoryTable;

and then use table for the rest of the method rather than using _customCategoryTable, only taking a lock (table) when we actually modify table, since Hashtable is thread-safe for any number of readers with one writer.

Might that be a performance/scalability problem?

i followed internal FindCustomCategory() and is used only in init path today, no hot path.
However your solution is better, so i agree with you. Thank's a lot.

MarcoRossignoli · 2018-03-14T16:56:11Z

@dotnet-bot test OSX x64 Debug Build

MarcoRossignoli · 2018-03-14T16:56:39Z

@dotnet-bot test Windows x86 Release Build

brianrob · 2018-03-14T23:33:58Z

Looks like this one is ready to merge. @stephentoub do you have any other feedback?

stephentoub · 2018-03-16T16:37:07Z

+            lock (_customCategoryTableLock)
+            {
+                _customCategoryTable = null;
+            }


Why is this lock still needed?

i understand, i'll remove it. Thanks!

stephentoub · 2018-03-16T16:37:24Z

                                //
                                categoryType = PerformanceCounterCategoryType.Unknown;
-                                _customCategoryTable[category] = categoryType;
+                                lock (_customCategoryTableLock)


You shouldn't need a separate _customCategoryTableLock... just lock on table.

you'are right. Thanks!

@stephentoub should be ok now

stephentoub · 2018-03-16T16:38:33Z

        {
            RegisterFiles(categoryName, true);
            DeleteRegistryEntry(categoryName);
-            CloseAllTables();


Why is CloseAllTables no longer needed?

CloseAllLibraries() call Close() on every PerformanceCounterLib instance that call CloseTables(). Isn't this redundant?

I don't know. I'm not familiar with the codebase.

@brianrob Can you confirm?

I'm inclined to say that we should leave CloseAllTables in place. I agree that the work is done implicitly by calling CloseAllLibraries, but some of this work is done outside of locks and I'd hate to change the characteristics of a race condition and break user code. Given that this functionality is here for compatibility, I'd leave it.

@brianrob ok

MarcoRossignoli · 2018-03-17T18:01:14Z

@dotnet-bot test Linux x64 Release Build please
@dotnet-bot test OSX x64 Debug Build please

MarcoRossignoli · 2018-03-18T13:50:52Z

@dotnet-bot test OSX x64 Debug Build please

MarcoRossignoli · 2018-03-19T08:18:59Z

@dotnet-bot test OSX x64 Debug Build please

danmoseley · 2018-03-22T20:43:21Z

@brianrob do you have further feedback? I'd like to get this committed to see whether it helps tests stop being flaky.

brianrob · 2018-03-22T21:18:41Z

@danmosemsft, I just replied back to the last comment. Once the requested changes are made as long as @stephentoub is OK with the changes, I am too.

MarcoRossignoli · 2018-03-22T21:34:08Z

@danmosemsft @brianrob @stephentoub restored CloseAllTables().

brianrob · 2018-03-22T22:00:45Z

@dotnet-bot test NETFX x86 Release Build
@dotnet-bot test UWP CoreCLR x64 Debug Build

brianrob

LGTM. Thanks!

* sync PerformanceCounterLib * add enclosing braces * address PR feedback * address PR feedback * address PR feedback * nit: trim spaces * address PR feedback * address stephentoub PR feedback * nit: tab * nit: tab * address PR feedback Commit migrated from dotnet/corefx@41c8ba2

sync PerformanceCounterLib

9c820bb

MarcoRossignoli changed the title ~~Sync PerformanceCounterLib~~ WIP Sync PerformanceCounterLib Jan 20, 2018

karelz added the area-System.Diagnostics.Tracing label Jan 25, 2018

karelz assigned MarcoRossignoli, brianrob, vancem and michellemcdaniel Jan 25, 2018

MarcoRossignoli changed the title ~~WIP Sync PerformanceCounterLib~~ [WIP] Sync PerformanceCounterLib Jan 25, 2018

MarcoRossignoli changed the title ~~[WIP] Sync PerformanceCounterLib~~ Sync PerformanceCounterLib Jan 31, 2018

danmoseley requested a review from michellemcdaniel February 16, 2018 17:34

brianrob reviewed Feb 21, 2018

View reviewed changes

add enclosing braces

6af6105

danmoseley requested a review from stephentoub February 23, 2018 18:46

MarcoRossignoli added 2 commits March 7, 2018 11:44

address PR feedback

f18a355

address PR feedback

33b961c

stephentoub reviewed Mar 14, 2018

View reviewed changes

MarcoRossignoli added 2 commits March 14, 2018 16:06

address PR feedback

b39a24d

nit: trim spaces

fe0b5e3

MarcoRossignoli changed the title ~~Sync PerformanceCounterLib~~ [WIP]Sync PerformanceCounterLib Mar 14, 2018

address PR feedback

a9b3372

MarcoRossignoli changed the title ~~[WIP]Sync PerformanceCounterLib~~ Sync PerformanceCounterLib Mar 14, 2018

MarcoRossignoli changed the title ~~Sync PerformanceCounterLib~~ WIP: Sync PerformanceCounterLib Mar 14, 2018

MarcoRossignoli changed the title ~~WIP: Sync PerformanceCounterLib~~ Sync PerformanceCounterLib Mar 14, 2018

stephentoub reviewed Mar 16, 2018

View reviewed changes

MarcoRossignoli added 3 commits March 16, 2018 19:06

address stephentoub PR feedback

b0c84b7

nit: tab

c24cf83

nit: tab

eec8624

address PR feedback

fd0fb85

brianrob approved these changes Mar 22, 2018

View reviewed changes

stephentoub merged commit 41c8ba2 into dotnet:master Mar 23, 2018

MarcoRossignoli deleted the perflib-test branch March 23, 2018 07:57

karelz added this to the 2.1.0 milestone Mar 27, 2018

MarcoRossignoli mentioned this pull request Jun 19, 2018

Add concurrent access detection tests to Dictionary<TKey, TValue> #30515

Merged

MarcoRossignoli mentioned this pull request Aug 17, 2018

Add System.Diagnostics.PerformanceData namespace #31474

Merged

Conversation

MarcoRossignoli commented Jan 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danmoseley commented Jan 20, 2018

Uh oh!

MarcoRossignoli commented Feb 16, 2018

Uh oh!

danmoseley commented Feb 16, 2018

Uh oh!

brianrob left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarcoRossignoli Feb 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarcoRossignoli Feb 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarcoRossignoli Feb 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarcoRossignoli Feb 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarcoRossignoli Feb 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarcoRossignoli Feb 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarcoRossignoli Feb 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danmoseley commented Feb 23, 2018

MarcoRossignoli commented Jan 20, 2018 •

edited

Loading

MarcoRossignoli Feb 21, 2018 •

edited

Loading

MarcoRossignoli Feb 24, 2018 •

edited

Loading

MarcoRossignoli Feb 21, 2018 •

edited

Loading

MarcoRossignoli Feb 21, 2018 •

edited

Loading

MarcoRossignoli Feb 21, 2018 •

edited

Loading

MarcoRossignoli Feb 21, 2018 •

edited

Loading

MarcoRossignoli Feb 24, 2018 •

edited

Loading

MarcoRossignoli Mar 14, 2018 •

edited

Loading

stephentoub Mar 14, 2018 •

edited

Loading