Skip to content

Revert PR #1839#1863

Closed
snnn wants to merge 1 commit intomasterfrom
snnn/revert_tp
Closed

Revert PR #1839#1863
snnn wants to merge 1 commit intomasterfrom
snnn/revert_tp

Conversation

@snnn
Copy link
Contributor

@snnn snnn commented Sep 17, 2019

This reverts commit 166b1f8.

Description:

This reverts commit 166b1f8 because our perf dashboard shows it caused great perf degradation.

model NEW_CHANGE(With #1839) OLD DIFF EnableParallelExecutor
bvlc_googlenet 0.019563 0.015691 24 0
inception_resnet_v2 0.101839 0.08823 15 0
inception_v2 0.015011 0.009312 61 0
mlperf_mobilenet 0.0041 0.002606 57 0
mlperf_resnet 0.023533 0.014486 62 0
mlperf_ssd_mobilenet_300 0.024972 0.021661 15 0
prod_model11 0.002839 0.002385 19 0
prod_model2 0.011392 0.007258 56 0
resnet50 0.023617 0.014324 64 0
squeezenet 0.003154 0.0022 43 0

Note:
The second column was run with thread pool size = 4
The third column was run with thread pool size = 3

Motivation and Context

  • Why is this change required? What problem does it solve?
  • If it fixes an open issue, please link to the issue here.

@snnn snnn requested a review from a team as a code owner September 17, 2019 21:11
@skottmckay
Copy link
Contributor

Do we understand why? I'm guessing by adding the +1 you forced lots of things into the 'total > NumThreads()' branch which doesn't re-use the current thread.

@snnn
Copy link
Contributor Author

snnn commented Sep 17, 2019

Do we understand why? I'm guessing by adding the +1 you forced lots of things into the 'total > NumThreads()' branch which doesn't re-use the current thread.

I didn't adding the +1, I removed it. So the logic should keep the same, it should still re-use the current thread. However, the little trick(reuse thread) only improved 10% performance. But I'm seeing much bigger perf drop.

@snnn
Copy link
Contributor Author

snnn commented Sep 18, 2019

Do we understand why? I'm guessing by adding the +1 you forced lots of things into the 'total > NumThreads()' branch which doesn't re-use the current thread.

If the machine has 4 CPUs but 5 threads, then the number of cpu migrations will increase, and all the math functions will be slow down, and instructions per cycle will get lower.

So,

  1. Either enable the main thread reuse, and set thread pool thread count to number of CPUs -1
  2. Or disable the reuse, and set thread pool thread count to just as the number of CPUs.

But the second one is 10% slower than the first one on mlperf_resnet50 model.

@snnn
Copy link
Contributor Author

snnn commented Sep 18, 2019

I wrote a test benchmark program without using onnxruntime(just use eigen)

Task:

                        int sum = 0;
			for (int i = 0; i != 100000; ++i)
				sum += i;
			total += sum;

4 tasks, on a 4 cores CPU
Run on (8 X 3408 MHz CPU s)
CPU Caches:
L1 Data 32K (x4)
L1 Instruction 32K (x4)
L2 Unified 262K (x4)
L3 Unified 8388K (x1)

Benchmark Time CPU Iterations
BM_ParallelForReuse/3 42789 ns 42729 ns 301684
BM_ParallelForReuse/4 58199 ns 58086 ns 240214
BM_ParallelForReuseNoSpin/3 38812 ns 38750 ns 350000
BM_ParallelForReuseNoSpin/4 39656 ns 39656 ns 356972
BM_ParallelFor/4 67346 ns 47862 ns 291857
BM_ParallelForNoSpin/4 67017 ns 47811 ns 314386

The number of Benchmark name is the number of threads.
It shows:

  1. reuse is better
  2. If disabling the spin lock inside the thread pool, then thread count doesn't matter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants