AutoTVM x86#1772
Conversation
| return None | ||
|
|
||
| import ast | ||
| kernel_size = ast.literal_eval(attrs["kernel_size"]) |
|
|
||
| # Set number of threads used for tuning based on the number of | ||
| # physical cpu cores on your machine. | ||
| num_threads = 1 |
| 'early_stopping': None, | ||
|
|
||
| 'measure_option': autotvm.measure_option( | ||
| builder=autotvm.LocalBuilder(n_parallel=1), |
There was a problem hiding this comment.
Can we use larger n_parallel?
There was a problem hiding this comment.
Doc string says that if n_parallel is None, it will use all cpu cores. Does it mean that if we have 18 cores, 18 jobs will be launched in parallel, and each job uses 1 cpu cores? The number of cores used by each process can also be controlled by TVM_NUM_THREADS.
There was a problem hiding this comment.
Measurement = Build + Run.
Build jobs can be launched in parallel. Run jobs will be run sequentially.
So it is safe to remove n_parallel=1 here.
Then the loop will be compiling 32 ConfigEntities in parallel and running them sequentially.
| # We do not run the tuning in our webpage server since it takes too long. | ||
| # Uncomment the following line to run it by yourself. | ||
|
|
||
| #tune_and_evaluate(tuning_option) |
There was a problem hiding this comment.
It's better to add a sample output here
|
Some other points
|
|
Graph tuner supports general schedule template. The key point is to define how to generate layout transformation given a schedule template. For now I just add built-in layout transform generator for NCHWc, but it is not hard to add support for other formats. I agree that we should finally tune different algorithms, The key part is still managing possible layout transforms between given schedules, which might come from different algorithms. In terms of problem formulation, I would like to treat it more related to solver. For PBQP solver, we may want to formulate it as graph and cost matrices. For DP or RL methods, it might be more suitable to use states/actions representation. Current design is base class converts records from autotvm and manages fundamental data structure. In specific solver, these data structures are used to formulate the problem in a way suitable for this solver. What do you think about this way? In current graph tuner implementation, there is some information specific to DP solver in base class. I'll refactor and move them under DPExecutor. |
|
@merrymercy https://github.com/dmlc/tvm/blob/master/topi/tests/python/test_topi_conv2d_nchw.py#L79 Looks like this test creates x86 conv2d decl but getting arm conv2d schedule from tophub. |
|
Yes, I will fix this |
|
|
||
| tuning_option = { | ||
| 'log_filename': log_file, | ||
| 'tuner': 'gridsearch', |
There was a problem hiding this comment.
Usually it takes several hours to tune a model on x86 machine, and grid search gives us best possible schedules. I think it's worth to get best possible performance here.
7d14570 to
30ad5b6
Compare
|
Can we switch to random? It will also cover the search space without much
overhead.
Eddie
…On Mon, Oct 8, 2018 at 1:44 PM Yao Wang ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In tutorials/autotvm/tune_nnvm_x86.py
<#1772 (comment)>:
> +#################################################################
+# Configure tensor tuning settings and create tasks
+# -------------------------------------------------
+# To get better kernel execution performance on x86 cpu,
+# we need to change data layout of convolution kernel from
+# "NCHW" to "NCHWc". To deal with this situation, we define
+# conv2d_NCHWc operator in topi. We will tune this operator
+# instead of plain conv2d.
+#
+# We will use local mode for tuning configuration. RPC tracker
+# mode can be setup similarly to the approach in autotvm
+# arm_cpu tutorial.
+
+tuning_option = {
+ 'log_filename': log_file,
+ 'tuner': 'gridsearch',
Usually it takes several hours to tune a model on x86 machine, and grid
search gives us best possible schedules. I think it's worth to get best
possible performance here.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1772 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ACIsgGhzF8qljoF116vJXtf3-G2F2NJ5ks5ui7kngaJpZM4W5eXP>
.
|
|
Comments addressed. |
| # instead of plain conv2d. | ||
| # | ||
| # We will use local mode for tuning configuration. RPC tracker | ||
| # mode can be setup similarly to the approach in autotvm |
There was a problem hiding this comment.
"setup similar as that in ..." and add :ref: for the tutorial?
| strides = strides if isinstance(strides, (tuple, list)) else (strides, strides) | ||
| if layout == 'NCHW': | ||
| _create_schedule_template(cfg, data, kernel, strides, padding, layout) | ||
| if cfg.is_fallback: |
There was a problem hiding this comment.
just trying to learn, where does this is_fallback come from?
There was a problem hiding this comment.
https://github.com/dmlc/tvm/blob/master/python/tvm/autotvm/task/space.py#L909 This is an attribute of ConfigSpace
|
good to me. @merrymercy could you review again? |
| global_dict_key = workload | ||
| dispatch_ctx.update_global_dict(global_dict_key, cfg) | ||
|
|
||
| if is_kernel_1x1: |
There was a problem hiding this comment.
If the kernel is 1x1. These two layouts are the same in memory
| _, _, kh, kw = get_const_tuple(kernel.shape) | ||
| is_kernel_1x1 = kh == 1 and kw == 1 | ||
| return conv2d_avx_1x1._declaration_conv(*args) if is_kernel_1x1 else \ | ||
| conv2d_avx_common._declaration_conv(*args) |
There was a problem hiding this comment.
It seems these two declarations are the same. If the length of a reduction axis is 1, TVM can eliminate that axis.
|
I think the declarations of 1x1 and common are same. Is it easy to merge them into a single template? Although this may enlarge search space and make tuning last longer. |
|
Agree with @merrymercy that the compute decl are the same which can be merged. While the schedule are slightly different, I guess we still need two templates? |
|
Most parts of decl are the same, except kernel shape, so that computes are slightly different. For templates, I prefer keeping two templates. This makes auto-tuning easier. |
|
If we can build a search space to cover both cases, it will be better. |
|
If we want to use a single template, certain parameter won't be used in either common or 1x1 case. I feel this might cause a bit confusion. Also I have already completed tuning for gluon cv models on avx512 and avx2 cpu. How difficult is it to directly transfer records to new templates? |
|
We can keep two different schedules. But the compute declarations are equivalent. Because they only differ in kernel dimensions whose lengths are one. Their layouts in memory are the same. |
|
Thanks @kevinthesun @merrymercy @eqy |
|
I am afraid that after this PR, users will see too many warning messages "Cannot find a config for workload xxx, A fallback configuration is used..." when they try CPU demos. One quick fix is to upload some tuned results for common networks to TopHub.
'llvm': "v0.01",Then tvm will download pre-tuned configs when compiling models. |
|
I think adding configs to TopHub for x86 is a good idea. |
|
I did some experiments on AWS EC2 machines. Currently the only flag differentiating x86 CPU models is the instruction set. C5 and M5 which have avx512 can share schedules, while C4 and M4 with AVX2 can share schedules. Later we can identify more specs, and have a guideline for developer to choose pre-tuned schedules. |
|
Now we use For example, for NVIDIA GPU, the targets in the log contain |
Model is interesting as there are currently a few different ways of specifying the "model" of x86: uarch (skylake-avx512 vs. avx2, ...), actual model (4790K vs. 8700K vs. 8180, ...), and instances (c5 vs. m4, ...). But I think we can defer solving this problem at least for a little while, even if we have to rely on a few profile runs before execusion; we can just do what cuDNN does here and try out a few schedules if the user does not know how/want to specify their exact model. |
|
I tried pre-tuned avx512 schedule available on tophub on core i9-7940X. I assume this schedule was tuned on xeon, and I wondered if the same schedule is equally good on core i9 chip. This is the result I got: resnet-50 : 9.54 ms I can't compare with MKL-DNN at the moment, but this result looks very good. I'm looking forward to testing on more networks (esp. densenet-121) and graph tuner. @kevinthesun |
|
@masahi Thank you for testing. It's an interesting topic of applying pre-tuned schedules to similar hardwares. We might want to investigate more in this area. I'll update graph tuner PR and upload more pre-tuned schedules soon. |
|
@masahi what's help do you need to test MKL-DNN? From the model level, you can use MXNet with subgraph feature to benchmark. The resnet50 and vgg-16 are available :) From the primitive level, you can use benchdnn to test.
|
|
@pengzhao-intel My avx-512 capable machine is Windows only unfortunately. As I mentioned in apache/mxnet#12891, on Windows MXNet + MKL-DNN performance is poor at the moment, due to the old MKL-DNN submodule. I'm waiting for submodule update (which I noticed just popped up in apache/mxnet#12953). |
|
Got it. Update in the PR :) But we don't test the performance on windows too much :( |
* AutoTVM for x86 conv2d * Add ApplyGraphBest dispatch context * Fix tutorial * Fix conv2d * Improve tutorial * Fix default schedule * Fix 1x1 default schedule loading * Fix workload type * Change gridsearch to random * Add reference to autotvm arm * Merge conv2d common and 1x1 decl * Fix lint * Minor fix
Apply AutoTVM style schedule to x86 conv2d template.
Add a new dispatch context "ApplyGraphBest" for AutoTVM to load graph level best schedules. This is useful for Graph Tuner.
@merrymercy @eqy @yzhliu