Skip to content

AutoTVM x86#1772

Merged
yzhliu merged 13 commits intoapache:masterfrom
kevinthesun:AutoTVMx86
Oct 18, 2018
Merged

AutoTVM x86#1772
yzhliu merged 13 commits intoapache:masterfrom
kevinthesun:AutoTVMx86

Conversation

@kevinthesun
Copy link
Copy Markdown
Contributor

Apply AutoTVM style schedule to x86 conv2d template.
Add a new dispatch context "ApplyGraphBest" for AutoTVM to load graph level best schedules. This is useful for Graph Tuner.

@merrymercy @eqy @yzhliu

Comment thread topi/python/topi/x86/conv2d.py Outdated
return None

import ast
kernel_size = ast.literal_eval(attrs["kernel_size"])
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can use attrs.get_int_tuple


# Set number of threads used for tuning based on the number of
# physical cpu cores on your machine.
num_threads = 1
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused variable?

Comment thread tutorials/autotvm/tune_nnvm_x86.py Outdated
'early_stopping': None,

'measure_option': autotvm.measure_option(
builder=autotvm.LocalBuilder(n_parallel=1),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use larger n_parallel?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc string says that if n_parallel is None, it will use all cpu cores. Does it mean that if we have 18 cores, 18 jobs will be launched in parallel, and each job uses 1 cpu cores? The number of cores used by each process can also be controlled by TVM_NUM_THREADS.

Copy link
Copy Markdown
Member

@merrymercy merrymercy Sep 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Measurement = Build + Run.
Build jobs can be launched in parallel. Run jobs will be run sequentially.
So it is safe to remove n_parallel=1 here.
Then the loop will be compiling 32 ConfigEntities in parallel and running them sequentially.

Comment thread tutorials/autotvm/tune_nnvm_x86.py Outdated
# We do not run the tuning in our webpage server since it takes too long.
# Uncomment the following line to run it by yourself.

#tune_and_evaluate(tuning_option)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to add a sample output here

@merrymercy
Copy link
Copy Markdown
Member

merrymercy commented Sep 26, 2018

Some other points

  1. We should upload some pre-tuned parameters for resnet and delete the old records in x86/conv2d.py

  2. In terms of graph level tuner. We should think about the formalization introduced by the paper we discussed in [RFC][Graph Tuner] Graph level auto-tuning #1585 .
    Currently we have different implementations for x86 cpu and arm cpu. Specifically, we have spatial pack, NCHWc_common, NCHWc_1x1 and winograd. On both x86 and arm, each of them has the chance to be the best for some input shapes. So the optimal solution is to merge the templates for x86 and arm cpu, and let the graph tuner choose them by considering both kernel time and layout transform time.
    Your current graph tuner only supports NCHWc template. Do you think it is easy to generalize it to support arbitrary templates? I think this can improve the performance a lot at least on arm cpu. And the framework will be quite generalizable since we can add many templates and let the graph tuner make the final optimal solution.

@kevinthesun
Copy link
Copy Markdown
Contributor Author

kevinthesun commented Sep 26, 2018

Graph tuner supports general schedule template. The key point is to define how to generate layout transformation given a schedule template. For now I just add built-in layout transform generator for NCHWc, but it is not hard to add support for other formats.

I agree that we should finally tune different algorithms, The key part is still managing possible layout transforms between given schedules, which might come from different algorithms.

In terms of problem formulation, I would like to treat it more related to solver. For PBQP solver, we may want to formulate it as graph and cost matrices. For DP or RL methods, it might be more suitable to use states/actions representation. Current design is base class converts records from autotvm and manages fundamental data structure. In specific solver, these data structures are used to formulate the problem in a way suitable for this solver. What do you think about this way? In current graph tuner implementation, there is some information specific to DP solver in base class. I'll refactor and move them under DPExecutor.

@kevinthesun
Copy link
Copy Markdown
Contributor Author

@merrymercy https://github.com/dmlc/tvm/blob/master/topi/tests/python/test_topi_conv2d_nchw.py#L79 Looks like this test creates x86 conv2d decl but getting arm conv2d schedule from tophub.

@merrymercy
Copy link
Copy Markdown
Member

Yes, I will fix this

Comment thread tutorials/autotvm/tune_nnvm_x86.py Outdated

tuning_option = {
'log_filename': log_file,
'tuner': 'gridsearch',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a good default here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually it takes several hours to tune a model on x86 machine, and grid search gives us best possible schedules. I think it's worth to get best possible performance here.

@eqy
Copy link
Copy Markdown
Contributor

eqy commented Oct 8, 2018 via email

@kevinthesun
Copy link
Copy Markdown
Contributor Author

Comments addressed.

Comment thread tutorials/autotvm/tune_nnvm_x86.py Outdated
# instead of plain conv2d.
#
# We will use local mode for tuning configuration. RPC tracker
# mode can be setup similarly to the approach in autotvm
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"setup similar as that in ..." and add :ref: for the tutorial?

strides = strides if isinstance(strides, (tuple, list)) else (strides, strides)
if layout == 'NCHW':
_create_schedule_template(cfg, data, kernel, strides, padding, layout)
if cfg.is_fallback:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just trying to learn, where does this is_fallback come from?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yzhliu
Copy link
Copy Markdown
Member

yzhliu commented Oct 11, 2018

good to me. @merrymercy could you review again?

Comment thread topi/python/topi/x86/conv2d.py Outdated
global_dict_key = workload
dispatch_ctx.update_global_dict(global_dict_key, cfg)

if is_kernel_1x1:
Copy link
Copy Markdown
Member

@merrymercy merrymercy Oct 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the kernel is 1x1. These two layouts are the same in memory

Comment thread topi/python/topi/x86/conv2d.py Outdated
_, _, kh, kw = get_const_tuple(kernel.shape)
is_kernel_1x1 = kh == 1 and kw == 1
return conv2d_avx_1x1._declaration_conv(*args) if is_kernel_1x1 else \
conv2d_avx_common._declaration_conv(*args)
Copy link
Copy Markdown
Member

@merrymercy merrymercy Oct 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems these two declarations are the same. If the length of a reduction axis is 1, TVM can eliminate that axis.

@merrymercy
Copy link
Copy Markdown
Member

merrymercy commented Oct 15, 2018

I think the declarations of 1x1 and common are same.

Is it easy to merge them into a single template? Although this may enlarge search space and make tuning last longer.
If it is not easy, we can just merge this.

@yzhliu
Copy link
Copy Markdown
Member

yzhliu commented Oct 15, 2018

Agree with @merrymercy that the compute decl are the same which can be merged. While the schedule are slightly different, I guess we still need two templates?

@yzhliu yzhliu self-assigned this Oct 15, 2018
@kevinthesun
Copy link
Copy Markdown
Contributor Author

Most parts of decl are the same, except kernel shape, so that computes are slightly different. For templates, I prefer keeping two templates. This makes auto-tuning easier.

@merrymercy
Copy link
Copy Markdown
Member

merrymercy commented Oct 15, 2018

If we can build a search space to cover both cases, it will be better.

@kevinthesun
Copy link
Copy Markdown
Contributor Author

If we want to use a single template, certain parameter won't be used in either common or 1x1 case. I feel this might cause a bit confusion. Also I have already completed tuning for gluon cv models on avx512 and avx2 cpu. How difficult is it to directly transfer records to new templates?

@merrymercy
Copy link
Copy Markdown
Member

merrymercy commented Oct 15, 2018

We can keep two different schedules. But the compute declarations are equivalent. Because they only differ in kernel dimensions whose lengths are one. Their layouts in memory are the same.

@yzhliu yzhliu merged commit dc996e4 into apache:master Oct 18, 2018
@yzhliu
Copy link
Copy Markdown
Member

yzhliu commented Oct 18, 2018

Thanks @kevinthesun @merrymercy @eqy

@merrymercy
Copy link
Copy Markdown
Member

merrymercy commented Oct 18, 2018

I am afraid that after this PR, users will see too many warning messages "Cannot find a config for workload xxx, A fallback configuration is used..." when they try CPU demos.
We should not disable this warning because it is critical for people who care about performance.
How can we deal with this in the long run?

One quick fix is to upload some tuned results for common networks to TopHub.
Steps:

  1. upstream a 'llvm_v0.01.log' to https://github.com/uwsampl/tvm-distro/tree/master/tophub
  2. add a line
'llvm': "v0.01",

to https://github.com/dmlc/tvm/blob/dc996e451a3fee9dffefc31b652b6e85a72cb041/python/tvm/autotvm/tophub.py#L22-L26

Then tvm will download pre-tuned configs when compiling models.

@eqy
Copy link
Copy Markdown
Contributor

eqy commented Oct 18, 2018

I think adding configs to TopHub for x86 is a good idea.
Do we currently have a way of specifying which models of x86 CPUs should share configs? There are likely to be x86 users with many different CPU models in the long run.

@kevinthesun
Copy link
Copy Markdown
Contributor Author

I did some experiments on AWS EC2 machines. Currently the only flag differentiating x86 CPU models is the instruction set. C5 and M5 which have avx512 can share schedules, while C4 and M4 with AVX2 can share schedules. Later we can identify more specs, and have a guideline for developer to choose pre-tuned schedules.

@merrymercy
Copy link
Copy Markdown
Member

Now we use -model in target to match the log. We record model in the pre-tuned log.

For example, for NVIDIA GPU, the targets in the log contain cuda -model=1080ti, cuda -model=titanx.
Then users specify model when they create the target.

https://github.com/dmlc/tvm/blob/4c13ee22bf8e9f693176418ee77f65890327ec3e/apps/benchmark/gpu_imagenet_bench.py#L41

@eqy
Copy link
Copy Markdown
Contributor

eqy commented Oct 18, 2018

Now we use -model in target to match the log. We record model in the pre-tuned log.

For example, for NVIDIA GPU, the targets in the log contain cuda -model=1080ti, cuda -model=titanx.
Then users specify model when they create the target.

tvm/apps/benchmark/gpu_imagenet_bench.py

Line 41 in 4c13ee2
target = tvm.target.create('%s -model=%s' % (args.target, args.model))

Model is interesting as there are currently a few different ways of specifying the "model" of x86: uarch (skylake-avx512 vs. avx2, ...), actual model (4790K vs. 8700K vs. 8180, ...), and instances (c5 vs. m4, ...). But I think we can defer solving this problem at least for a little while, even if we have to rely on a few profile runs before execusion; we can just do what cuDNN does here and try out a few schedules if the user does not know how/want to specify their exact model.

@masahi
Copy link
Copy Markdown
Member

masahi commented Oct 24, 2018

I tried pre-tuned avx512 schedule available on tophub on core i9-7940X. I assume this schedule was tuned on xeon, and I wondered if the same schedule is equally good on core i9 chip.

This is the result I got:

resnet-50 : 9.54 ms
vgg-16: 25.35 ms

I can't compare with MKL-DNN at the moment, but this result looks very good. I'm looking forward to testing on more networks (esp. densenet-121) and graph tuner. @kevinthesun

@kevinthesun
Copy link
Copy Markdown
Contributor Author

@masahi Thank you for testing. It's an interesting topic of applying pre-tuned schedules to similar hardwares. We might want to investigate more in this area. I'll update graph tuner PR and upload more pre-tuned schedules soon.

@pengzhao-intel
Copy link
Copy Markdown

pengzhao-intel commented Oct 24, 2018

@masahi what's help do you need to test MKL-DNN?

From the model level, you can use MXNet with subgraph feature to benchmark. The resnet50 and vgg-16 are available :)
https://github.com/apache/incubator-mxnet/blob/master/MKLDNN_README.md#6
https://github.com/apache/incubator-mxnet/blob/master/example/image-classification/benchmark_score.py

From the primitive level, you can use benchdnn to test.
https://github.com/intel/mkl-dnn/tree/master/tests/benchdnn

./benchdnn --conv --mode=PC -v1 --mb=128 --dir=FWD_B --cfg=f32 g1ic1280oc1152_ih8oh8kh1sh1dh0ph0_iw8ow8kw1sw1dw0pw0
unfused convs:
1280x320: total perf: min(ms):1.08325 avg(ms):1.12108
1280x384: total perf: min(ms):1.24365 avg(ms):1.36292
1280x448: total perf: min(ms):1.50342 avg(ms):1.60265
sum: min:3.83032 avg:4.08665
fused conv (320 + 384 + 448):
1280x1152: total perf: min(ms):3.75757 avg(ms):5.03734

@masahi
Copy link
Copy Markdown
Member

masahi commented Oct 24, 2018

@pengzhao-intel My avx-512 capable machine is Windows only unfortunately. As I mentioned in apache/mxnet#12891, on Windows MXNet + MKL-DNN performance is poor at the moment, due to the old MKL-DNN submodule.

I'm waiting for submodule update (which I noticed just popped up in apache/mxnet#12953).

@pengzhao-intel
Copy link
Copy Markdown

pengzhao-intel commented Oct 24, 2018

Got it. Update in the PR :) But we don't test the performance on windows too much :(
If there's an issue, feel free to let me know.

FrozenGene pushed a commit to FrozenGene/tvm that referenced this pull request Dec 27, 2018
* AutoTVM for x86 conv2d

* Add ApplyGraphBest dispatch context

* Fix tutorial

* Fix conv2d

* Improve tutorial

* Fix default schedule

* Fix 1x1 default schedule loading

* Fix workload type

* Change gridsearch to random

* Add reference to autotvm arm

* Merge conv2d common and 1x1 decl

* Fix lint

* Minor fix
@kevinthesun kevinthesun deleted the AutoTVMx86 branch May 28, 2019 23:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants