Add steps to install multi-threaded OpenBLAS on Ubuntu by kloudkl · Pull Request #80 · BVLC/caffe

kloudkl · 2014-02-07T09:56:21Z

Multi-threaded OpenBLAS makes a huge performance difference. The benchmarks with and without it in comments to #16 demonstrated more than 5 times speed-up for boost-eigen and MKL on a machine with 4 Hyper-Threading CPU cores (supporting 8 threads).

This fixes #79.

Yangqing · 2014-02-07T16:04:27Z

Are you sure when using boost-eigen, you are compiling with multi-thread enabled? boost-eigen naturally comes with multithreaded gemm, which would probably account for most of the gain you are observing.

Yangqing · 2014-02-07T16:07:47Z

http://stackoverflow.com/questions/14783219/how-to-speed-up-eigen-librarys-matrix-product

kloudkl · 2014-02-07T18:39:30Z

To make it clear whether OpenBLAS or Eigen contributed to the performance improvements in the boost-eigen branch, three groups of benchmark experiments with different compilation flags are conducted using the lenet*.prototxt. For all the experiments, max iter is set to 3,000 and solver_mode is 0.

cf_id	compilation flags
1	-latlas -lcblas -fopenmp
2	-lopenblas
3	-lopenblas -fopenmp

To check the effects of threads number, three runtime environment variables combinations are tested.

rev_id	runtime environment variables
1	``
2	`OPENBLAS_NUM_THREADS=4 OMP_NUM_THREADS=4`
3	`OPENBLAS_NUM_THREADS=8 OMP_NUM_THREADS=8`

In all the experiments, max iter is set to 3,000 and solver_mode is set to 0 in lenet_solver.prototxt.

cf_id	rev_id	real time	user time	system time
1	1	500.638	500.559	0.328
1	2	501.15	501.37	0.26
2	1	99.787	230.694	166.238
2	2	99.42	228.74	166.25
2	3	100.56	232.78	166.66
3	1	99.915	231.802	165.206
3	2	99.34	229.79	165.15
3	3	99.73	232.86	163.89

Comparing the results of compilation flags 1 and 3, it is evident that the multi-threaded OpenBLAS runs about 5 times faster than the normal ATLAS. The similar performances of compilation flags 2 and 3 prove that enabling OpenMP for Eigen does not help at all in this setting.

Yangqing · 2014-02-07T18:45:24Z

I still do not think you are using the multithreaded version of eigen3.
With benchmarks as follows:

https://plafrim.bordeaux.inria.fr/doku.php?id=people:guenneba

it would be extremely unlikely that eigen itself is bad in multithreading.
Could you double-check with a gemm call that your eigen version is using
multiple threads (by e.g. looking at top)?

Again, using lenet is not a good idea to benchmark things, use
net_speed_test instead, which fits real-world use cases better.

Yangqing

On Fri, Feb 7, 2014 at 10:39 AM, kloudkl notifications@github.com wrote:

To make it clear whether OpenBLAS or Eigen contributed to the performance
improvements in the boost-eigen branch, three groups of benchmark
experiments with different compilation flags are conducted using the
lenet*.prototxt. For all the experiments, max iter is set to 3,000 and
solver_mode is 0.
cf_id compilation flags 1 -latlas -lcblas -fopenmp 2 -lopenblas 3 -lopenblas
-fopenmp

To check the effects of threads number, three runtime environment
variables combinations are tested.
rev_id runtime environment variables 1 `` 2 OPENBLAS_NUM_THREADS=4
OMP_NUM_THREADS=4 3 OPENBLAS_NUM_THREADS=8 OMP_NUM_THREADS=8

In all the experiments, max iter is set to 3,000 and solver_mode is set to
0 in lenet_solver.prototxt.
cf_id rev_id real time user time system time 1 1 500.638 500.559 0.328
1 2 501.15 501.37 0.26 2 1 99.787 230.694 166.238 2 2 99.42 228.74
166.25 2 3 100.56 232.78 166.66 3 1 99.915 231.802 165.206 3 2 99.34
229.79 165.15 3 3 99.73 232.86 163.89

Comparing the results of compilation flags 1 and 3, it is evident that the
multi-threaded OpenBLAS runs about 5 times faster than the normal ATLAS.
The similar performances of compilation flags 2 and 3 prove that enabling
OpenMP for Eigen does not help at all in this setting.

Reply to this email directly or view it on GitHubhttps://github.com//pull/80#issuecomment-34486255
.

Yangqing · 2014-02-07T18:48:54Z

I'd like to make my arguments clear:

(1) I am not comparing ATLAS with OpenBLAS - it is known that ATLAS is
inherently single-threaded. Eigen compiled with Atlas backend is not what I
mean here - I mean native implementations of blas inside eigen.

(2) small datasets like MNIST does not reflect actual use cases such as
imagenet. In imagenet experiments, more than 80% computation time is spent
on gemm, so it really boils down to the point whether gemm is parallelized
or not - and I believe eigen does have that parallelized. Please provide a
more detailed analysis where the speedup comes from and why, rather than an
end-to-end run (honestly, that may not reveal too much information).

Yangqing

On Fri, Feb 7, 2014 at 10:44 AM, Yangqing Jia jiayq84@gmail.com wrote:

I still do not think you are using the multithreaded version of eigen3.
With benchmarks as follows:

https://plafrim.bordeaux.inria.fr/doku.php?id=people:guenneba

it would be extremely unlikely that eigen itself is bad in multithreading.
Could you double-check with a gemm call that your eigen version is using
multiple threads (by e.g. looking at top)?

Again, using lenet is not a good idea to benchmark things, use
net_speed_test instead, which fits real-world use cases better.

Yangqing

On Fri, Feb 7, 2014 at 10:39 AM, kloudkl notifications@github.com wrote:

To make it clear whether OpenBLAS or Eigen contributed to the performance
improvements in the boost-eigen branch, three groups of benchmark
experiments with different compilation flags are conducted using the
lenet*.prototxt. For all the experiments, max iter is set to 3,000 and
solver_mode is 0.
cf_id compilation flags 1 -latlas -lcblas -fopenmp 2 -lopenblas 3 -lopenblas
-fopenmp

To check the effects of threads number, three runtime environment
variables combinations are tested.
rev_id runtime environment variables 1 `` 2 OPENBLAS_NUM_THREADS=4
OMP_NUM_THREADS=4 3 OPENBLAS_NUM_THREADS=8 OMP_NUM_THREADS=8

In all the experiments, max iter is set to 3,000 and solver_mode is set
to 0 in lenet_solver.prototxt.
cf_id rev_id real time user time system time 1 1 500.638 500.559 0.328
1 2 501.15 501.37 0.26 2 1 99.787 230.694 166.238 2 2 99.42 228.74
166.25 2 3 100.56 232.78 166.66 3 1 99.915 231.802 165.206 3 2 99.34
229.79 165.15 3 3 99.73 232.86 163.89

Comparing the results of compilation flags 1 and 3, it is evident that
the multi-threaded OpenBLAS runs about 5 times faster than the normal
ATLAS. The similar performances of compilation flags 2 and 3 prove that
enabling OpenMP for Eigen does not help at all in this setting.

Reply to this email directly or view it on GitHubhttps://github.com//pull/80#issuecomment-34486255
.

Yangqing · 2014-02-07T19:00:31Z

I looked at the code more closely and now I have a little clearer picture on what caused this. in caffe/util/math_functions.cpp the gemm calls are still made using cblas_gemm instead of the eigen function, making the framework effectively still using atlas rather than eigen to carry out gemm. I will close this issue and open a separate issue indicating this necessary change for boost-eigen. If you would like to do a more detailed comparison please feel free to. Thanks for finding this bug!

shelhamer · 2014-02-07T19:06:05Z

Thank you for all this benchmarking work!

shelhamer · 2014-02-10T23:47:01Z

INSTALL.md has been replaced with a pointer to the online installation documentation to avoid the overhead of duplication, so refer to #81.

jeffhammond · 2015-05-13T02:53:05Z

This statement is categorically false: "it is known that ATLAS is inherently single-threaded." ATLAS has been threaded for 5+ years

http://math-atlas.sourceforge.net/faq.html#tnum
http://math-atlas.sourceforge.net/timing/newThr395/index.html

Add cudnn v4 batch normalization integration

* Fix boost shared_ptr issue in python interface * Default output model name for bn convert style script * Fix bugs in generation bn inference model * Script to convert inner product to convolution * Script to do polyak averaging

standardize memory optimization configurations * yjxiong/fix/mem_config: take care of share data with excluded blob improvise memory opt configs fix cudnn conv legacy bug (BVLC#96) add TOC Update README.md Update README.md (BVLC#95) Update README.md Improve the python interface (BVLC#80) Update README.md

…caffe into imagenet_vid_2016 * 'imagenet_vid_2016' of https://github.com/myfavouritekk/caffe: take care of share data with excluded blob Revert "Fix a but when setting no_mem_opt: true for layers near in-place layers." improvise memory opt configs fix cudnn conv legacy bug (BVLC#96) add TOC Update README.md Update README.md (BVLC#95) Update README.md Improve the python interface (BVLC#80) Update README.md

Add steps to install multi-threaded OpenBLAS on Ubuntu

472064d

kloudkl mentioned this pull request Feb 7, 2014

Add steps to install multi-threaded OpenBLAS on Ubuntu for the gh-pages branch #81

Closed

Yangqing closed this Feb 7, 2014

Yangqing mentioned this pull request Feb 7, 2014

Make gemm fully dependent on eigen #84

Closed

kloudkl mentioned this pull request Feb 8, 2014

Replace atlas/cblas routines with Eigen in the math functions #85

Closed

kloudkl deleted the multi_threaded_blas branch February 11, 2014 05:47

shelhamer added the hardware/portability label Feb 12, 2014

thatguymike added a commit to thatguymike/caffe that referenced this pull request Dec 2, 2015

Merge pull request BVLC#80 from andrei-pokrovsky/caffe-0.14

e9f8357

Add cudnn v4 batch normalization integration

tkasarla mentioned this pull request Mar 11, 2017

matcaffe throws block of errors when trying to run code. #5388

Open

shuguang101 mentioned this pull request Jan 20, 2018

Segmentation Fault: 11 - OSX high sierra - please Help #6019

Open

GerHobbelt pushed a commit to GerHobbelt/caffe that referenced this pull request Sep 28, 2025

refactor: make logger torch-compatible (BVLC#80)

4ce5bff

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Add steps to install multi-threaded OpenBLAS on Ubuntu#80

Add steps to install multi-threaded OpenBLAS on Ubuntu#80
kloudkl wants to merge 1 commit intoBVLC:masterfrom
kloudkl:multi_threaded_blas

kloudkl commented Feb 7, 2014

Uh oh!

Yangqing commented Feb 7, 2014

Uh oh!

Yangqing commented Feb 7, 2014

Uh oh!

kloudkl commented Feb 7, 2014

Uh oh!

Yangqing commented Feb 7, 2014

Uh oh!

Yangqing commented Feb 7, 2014

Uh oh!

Yangqing commented Feb 7, 2014

Uh oh!

shelhamer commented Feb 7, 2014

Uh oh!

shelhamer commented Feb 10, 2014

Uh oh!

jeffhammond commented May 13, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Conversation

kloudkl commented Feb 7, 2014

Uh oh!

Yangqing commented Feb 7, 2014

Uh oh!

Yangqing commented Feb 7, 2014

Uh oh!

kloudkl commented Feb 7, 2014

Uh oh!

Yangqing commented Feb 7, 2014

Uh oh!

Yangqing commented Feb 7, 2014

Uh oh!

Yangqing commented Feb 7, 2014

Uh oh!

shelhamer commented Feb 7, 2014

Uh oh!

shelhamer commented Feb 10, 2014

Uh oh!

jeffhammond commented May 13, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants