AutoTVM x86 by kevinthesun · Pull Request #1772 · apache/tvm

kevinthesun · 2018-09-25T21:02:42Z

Apply AutoTVM style schedule to x86 conv2d template.
Add a new dispatch context "ApplyGraphBest" for AutoTVM to load graph level best schedules. This is useful for Graph Tuner.

@merrymercy @eqy @yzhliu

merrymercy · 2018-09-26T14:43:17Z

+        return None
+
+    import ast
+    kernel_size = ast.literal_eval(attrs["kernel_size"])


can use attrs.get_int_tuple

merrymercy · 2018-09-26T14:47:27Z

+
+# Set number of threads used for tuning based on the number of
+# physical cpu cores on your machine.
+num_threads = 1


Unused variable?

merrymercy · 2018-09-26T14:48:33Z

+    'early_stopping': None,
+
+    'measure_option': autotvm.measure_option(
+        builder=autotvm.LocalBuilder(n_parallel=1),


Can we use larger n_parallel?

Doc string says that if n_parallel is None, it will use all cpu cores. Does it mean that if we have 18 cores, 18 jobs will be launched in parallel, and each job uses 1 cpu cores? The number of cores used by each process can also be controlled by TVM_NUM_THREADS.

Measurement = Build + Run.
Build jobs can be launched in parallel. Run jobs will be run sequentially.
So it is safe to remove n_parallel=1 here.
Then the loop will be compiling 32 ConfigEntities in parallel and running them sequentially.

merrymercy · 2018-09-26T14:49:11Z

+# We do not run the tuning in our webpage server since it takes too long.
+# Uncomment the following line to run it by yourself.
+
+#tune_and_evaluate(tuning_option)


It's better to add a sample output here

merrymercy · 2018-09-26T15:21:01Z

Some other points

We should upload some pre-tuned parameters for resnet and delete the old records in x86/conv2d.py
In terms of graph level tuner. We should think about the formalization introduced by the paper we discussed in [RFC][Graph Tuner] Graph level auto-tuning #1585 .
Currently we have different implementations for x86 cpu and arm cpu. Specifically, we have spatial pack, NCHWc_common, NCHWc_1x1 and winograd. On both x86 and arm, each of them has the chance to be the best for some input shapes. So the optimal solution is to merge the templates for x86 and arm cpu, and let the graph tuner choose them by considering both kernel time and layout transform time.
Your current graph tuner only supports NCHWc template. Do you think it is easy to generalize it to support arbitrary templates? I think this can improve the performance a lot at least on arm cpu. And the framework will be quite generalizable since we can add many templates and let the graph tuner make the final optimal solution.

kevinthesun · 2018-09-26T17:32:48Z

Graph tuner supports general schedule template. The key point is to define how to generate layout transformation given a schedule template. For now I just add built-in layout transform generator for NCHWc, but it is not hard to add support for other formats.

I agree that we should finally tune different algorithms, The key part is still managing possible layout transforms between given schedules, which might come from different algorithms.

In terms of problem formulation, I would like to treat it more related to solver. For PBQP solver, we may want to formulate it as graph and cost matrices. For DP or RL methods, it might be more suitable to use states/actions representation. Current design is base class converts records from autotvm and manages fundamental data structure. In specific solver, these data structures are used to formulate the problem in a way suitable for this solver. What do you think about this way? In current graph tuner implementation, there is some information specific to DP solver in base class. I'll refactor and move them under DPExecutor.

kevinthesun · 2018-10-01T01:14:02Z

@merrymercy https://github.com/dmlc/tvm/blob/master/topi/tests/python/test_topi_conv2d_nchw.py#L79 Looks like this test creates x86 conv2d decl but getting arm conv2d schedule from tophub.

merrymercy · 2018-10-01T19:06:12Z

Yes, I will fix this

eqy · 2018-10-06T05:23:05Z

+
+tuning_option = {
+    'log_filename': log_file,
+    'tuner': 'gridsearch',


Is this a good default here?

Usually it takes several hours to tune a model on x86 machine, and grid search gives us best possible schedules. I think it's worth to get best possible performance here.

eqy · 2018-10-08T20:47:03Z

Can we switch to random? It will also cover the search space without much overhead. Eddie

…

On Mon, Oct 8, 2018 at 1:44 PM Yao Wang ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In tutorials/autotvm/tune_nnvm_x86.py <#1772 (comment)>: > +################################################################# +# Configure tensor tuning settings and create tasks +# ------------------------------------------------- +# To get better kernel execution performance on x86 cpu, +# we need to change data layout of convolution kernel from +# "NCHW" to "NCHWc". To deal with this situation, we define +# conv2d_NCHWc operator in topi. We will tune this operator +# instead of plain conv2d. +# +# We will use local mode for tuning configuration. RPC tracker +# mode can be setup similarly to the approach in autotvm +# arm_cpu tutorial. + +tuning_option = { + 'log_filename': log_file, + 'tuner': 'gridsearch', Usually it takes several hours to tune a model on x86 machine, and grid search gives us best possible schedules. I think it's worth to get best possible performance here. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1772 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACIsgGhzF8qljoF116vJXtf3-G2F2NJ5ks5ui7kngaJpZM4W5eXP> .

kevinthesun · 2018-10-08T21:34:00Z

Comments addressed.

yzhliu · 2018-10-10T03:50:16Z

+# instead of plain conv2d.
+#
+# We will use local mode for tuning configuration. RPC tracker
+# mode can be setup similarly to the approach in autotvm


"setup similar as that in ..." and add :ref: for the tutorial?

yzhliu · 2018-10-10T05:28:43Z

+    strides = strides if isinstance(strides, (tuple, list)) else (strides, strides)
+    if layout == 'NCHW':
+        _create_schedule_template(cfg, data, kernel, strides, padding, layout)
+        if cfg.is_fallback:


just trying to learn, where does this is_fallback come from?

https://github.com/dmlc/tvm/blob/master/python/tvm/autotvm/task/space.py#L909 This is an attribute of ConfigSpace

yzhliu · 2018-10-11T04:58:12Z

good to me. @merrymercy could you review again?

merrymercy · 2018-10-15T16:42:56Z

+        global_dict_key = workload
+        dispatch_ctx.update_global_dict(global_dict_key, cfg)
+
+    if is_kernel_1x1:


If the kernel is 1x1. These two layouts are the same in memory

merrymercy · 2018-10-15T16:47:13Z

+        _, _, kh, kw = get_const_tuple(kernel.shape)
+        is_kernel_1x1 = kh == 1 and kw == 1
+        return conv2d_avx_1x1._declaration_conv(*args) if is_kernel_1x1 else \
+            conv2d_avx_common._declaration_conv(*args)


It seems these two declarations are the same. If the length of a reduction axis is 1, TVM can eliminate that axis.

merrymercy · 2018-10-15T17:04:22Z

I think the declarations of 1x1 and common are same.

Is it easy to merge them into a single template? Although this may enlarge search space and make tuning last longer.
If it is not easy, we can just merge this.

yzhliu · 2018-10-15T18:07:19Z

Agree with @merrymercy that the compute decl are the same which can be merged. While the schedule are slightly different, I guess we still need two templates?

kevinthesun · 2018-10-15T19:00:32Z

Most parts of decl are the same, except kernel shape, so that computes are slightly different. For templates, I prefer keeping two templates. This makes auto-tuning easier.

merrymercy · 2018-10-15T19:00:52Z

If we can build a search space to cover both cases, it will be better.

kevinthesun · 2018-10-15T19:30:04Z

If we want to use a single template, certain parameter won't be used in either common or 1x1 case. I feel this might cause a bit confusion. Also I have already completed tuning for gluon cv models on avx512 and avx2 cpu. How difficult is it to directly transfer records to new templates?

merrymercy · 2018-10-15T20:27:16Z

We can keep two different schedules. But the compute declarations are equivalent. Because they only differ in kernel dimensions whose lengths are one. Their layouts in memory are the same.

yzhliu · 2018-10-18T01:20:55Z

Thanks @kevinthesun @merrymercy @eqy

merrymercy · 2018-10-18T07:37:06Z

I am afraid that after this PR, users will see too many warning messages "Cannot find a config for workload xxx, A fallback configuration is used..." when they try CPU demos.
We should not disable this warning because it is critical for people who care about performance.
How can we deal with this in the long run?

One quick fix is to upload some tuned results for common networks to TopHub.
Steps:

upstream a 'llvm_v0.01.log' to https://github.com/uwsampl/tvm-distro/tree/master/tophub
add a line

'llvm': "v0.01",

to https://github.com/dmlc/tvm/blob/dc996e451a3fee9dffefc31b652b6e85a72cb041/python/tvm/autotvm/tophub.py#L22-L26

Then tvm will download pre-tuned configs when compiling models.

eqy · 2018-10-18T08:12:54Z

I think adding configs to TopHub for x86 is a good idea.
Do we currently have a way of specifying which models of x86 CPUs should share configs? There are likely to be x86 users with many different CPU models in the long run.

kevinthesun · 2018-10-18T18:08:14Z

I did some experiments on AWS EC2 machines. Currently the only flag differentiating x86 CPU models is the instruction set. C5 and M5 which have avx512 can share schedules, while C4 and M4 with AVX2 can share schedules. Later we can identify more specs, and have a guideline for developer to choose pre-tuned schedules.

merrymercy · 2018-10-18T19:08:26Z

Now we use -model in target to match the log. We record model in the pre-tuned log.

For example, for NVIDIA GPU, the targets in the log contain cuda -model=1080ti, cuda -model=titanx.
Then users specify model when they create the target.

https://github.com/dmlc/tvm/blob/4c13ee22bf8e9f693176418ee77f65890327ec3e/apps/benchmark/gpu_imagenet_bench.py#L41

eqy · 2018-10-18T19:19:24Z

Now we use -model in target to match the log. We record model in the pre-tuned log.

For example, for NVIDIA GPU, the targets in the log contain cuda -model=1080ti, cuda -model=titanx.
Then users specify model when they create the target.

tvm/apps/benchmark/gpu_imagenet_bench.py

Line 41 in 4c13ee2
target = tvm.target.create('%s -model=%s' % (args.target, args.model))

Model is interesting as there are currently a few different ways of specifying the "model" of x86: uarch (skylake-avx512 vs. avx2, ...), actual model (4790K vs. 8700K vs. 8180, ...), and instances (c5 vs. m4, ...). But I think we can defer solving this problem at least for a little while, even if we have to rely on a few profile runs before execusion; we can just do what cuDNN does here and try out a few schedules if the user does not know how/want to specify their exact model.

masahi · 2018-10-24T01:17:31Z

I tried pre-tuned avx512 schedule available on tophub on core i9-7940X. I assume this schedule was tuned on xeon, and I wondered if the same schedule is equally good on core i9 chip.

This is the result I got:

resnet-50 : 9.54 ms
vgg-16: 25.35 ms

I can't compare with MKL-DNN at the moment, but this result looks very good. I'm looking forward to testing on more networks (esp. densenet-121) and graph tuner. @kevinthesun

kevinthesun · 2018-10-24T03:21:39Z

@masahi Thank you for testing. It's an interesting topic of applying pre-tuned schedules to similar hardwares. We might want to investigate more in this area. I'll update graph tuner PR and upload more pre-tuned schedules soon.

pengzhao-intel · 2018-10-24T03:28:55Z

@masahi what's help do you need to test MKL-DNN?

From the model level, you can use MXNet with subgraph feature to benchmark. The resnet50 and vgg-16 are available :)
https://github.com/apache/incubator-mxnet/blob/master/MKLDNN_README.md#6
https://github.com/apache/incubator-mxnet/blob/master/example/image-classification/benchmark_score.py

From the primitive level, you can use benchdnn to test.
https://github.com/intel/mkl-dnn/tree/master/tests/benchdnn

./benchdnn --conv --mode=PC -v1 --mb=128 --dir=FWD_B --cfg=f32 g1ic1280oc1152_ih8oh8kh1sh1dh0ph0_iw8ow8kw1sw1dw0pw0
unfused convs:
1280x320: total perf: min(ms):1.08325 avg(ms):1.12108
1280x384: total perf: min(ms):1.24365 avg(ms):1.36292
1280x448: total perf: min(ms):1.50342 avg(ms):1.60265
sum: min:3.83032 avg:4.08665
fused conv (320 + 384 + 448):
1280x1152: total perf: min(ms):3.75757 avg(ms):5.03734

masahi · 2018-10-24T03:44:28Z

@pengzhao-intel My avx-512 capable machine is Windows only unfortunately. As I mentioned in apache/mxnet#12891, on Windows MXNet + MKL-DNN performance is poor at the moment, due to the old MKL-DNN submodule.

I'm waiting for submodule update (which I noticed just popped up in apache/mxnet#12953).

pengzhao-intel · 2018-10-24T03:51:15Z

Got it. Update in the PR :) But we don't test the performance on windows too much :(
If there's an issue, feel free to let me know.

* AutoTVM for x86 conv2d * Add ApplyGraphBest dispatch context * Fix tutorial * Fix conv2d * Improve tutorial * Fix default schedule * Fix 1x1 default schedule loading * Fix workload type * Change gridsearch to random * Add reference to autotvm arm * Merge conv2d common and 1x1 decl * Fix lint * Minor fix

ZihengJiang added the status: need review label Sep 26, 2018

merrymercy requested changes Sep 26, 2018

View reviewed changes

merrymercy mentioned this pull request Oct 2, 2018

[TOPI] Update TopHub and benchmark #1796

Merged

eqy reviewed Oct 6, 2018

View reviewed changes

Wang and others added 8 commits October 8, 2018 13:45

AutoTVM for x86 conv2d

271dac3

Add ApplyGraphBest dispatch context

f2a923a

Fix tutorial

bc11542

Fix conv2d

a65884d

Improve tutorial

9f405ef

Fix default schedule

fbde12f

Fix 1x1 default schedule loading

3c7e7ab

Fix workload type

30ad5b6

kevinthesun force-pushed the AutoTVMx86 branch from 7d14570 to 30ad5b6 Compare October 8, 2018 20:46

Change gridsearch to random

2a60908

eqy approved these changes Oct 8, 2018

View reviewed changes

yzhliu added the status: review in progress label Oct 9, 2018

yzhliu reviewed Oct 10, 2018

View reviewed changes

Add reference to autotvm arm

c63aeb6

merrymercy requested changes Oct 15, 2018

View reviewed changes

yzhliu self-assigned this Oct 15, 2018

Wang added 3 commits October 16, 2018 12:03

Merge conv2d common and 1x1 decl

9c0ee50

Fix lint

9d44201

Minor fix

f959a41

merrymercy approved these changes Oct 17, 2018

View reviewed changes

yzhliu approved these changes Oct 18, 2018

View reviewed changes

yzhliu merged commit dc996e4 into apache:master Oct 18, 2018

tqchen added status: accepted and removed status: need review status: review in progress labels Oct 18, 2018

kevinthesun deleted the AutoTVMx86 branch May 28, 2019 23:19

Conversation

kevinthesun commented Sep 25, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

merrymercy Sep 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

merrymercy commented Sep 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevinthesun commented Sep 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevinthesun commented Oct 1, 2018

Uh oh!

merrymercy commented Oct 1, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eqy commented Oct 8, 2018 via email

Uh oh!

kevinthesun commented Oct 8, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yzhliu commented Oct 11, 2018

Uh oh!

merrymercy Oct 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

merrymercy Oct 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

merrymercy commented Oct 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yzhliu commented Oct 15, 2018

Uh oh!

kevinthesun commented Oct 15, 2018

Uh oh!

merrymercy commented Oct 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevinthesun commented Oct 15, 2018

Uh oh!

merrymercy commented Oct 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yzhliu commented Oct 18, 2018

Uh oh!

merrymercy commented Oct 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eqy commented Oct 18, 2018

Uh oh!

kevinthesun commented Oct 18, 2018

Uh oh!

merrymercy commented Oct 18, 2018

Uh oh!

eqy commented Oct 18, 2018

Uh oh!

masahi commented Oct 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

merrymercy Sep 28, 2018 •

edited

Loading

merrymercy commented Sep 26, 2018 •

edited

Loading

kevinthesun commented Sep 26, 2018 •

edited

Loading

merrymercy Oct 15, 2018 •

edited

Loading

merrymercy Oct 15, 2018 •

edited

Loading

merrymercy commented Oct 15, 2018 •

edited

Loading

merrymercy commented Oct 15, 2018 •

edited

Loading

merrymercy commented Oct 15, 2018 •

edited

Loading

merrymercy commented Oct 18, 2018 •

edited

Loading

masahi commented Oct 24, 2018 •

edited

Loading

pengzhao-intel commented Oct 24, 2018 •

edited

Loading

pengzhao-intel commented Oct 24, 2018 •

edited

Loading