-
Notifications
You must be signed in to change notification settings - Fork 3.8k
[MetaSchedule] Tuning API cleanup & ergonomics #12895
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MetaSchedule] Tuning API cleanup & ergonomics #12895
Conversation
ad84e8d to
c4afc4d
Compare
|
I like this change of decoupling compilation and tuning, the changes to default classes usage also make sense. Please let me know when the PR is ready for review. |
be1b58b to
9c28959
Compare
fb0e91e to
ea280df
Compare
6943a88 to
83871fe
Compare
|
Hey @masahi, I added On the other hand, as a high-level API, I would prefer not to tweak |
83871fe to
612979f
Compare
Yes, I agree with this. A part of the reason I didn't want to change |
612979f to
d99ac85
Compare
105fd0c to
7b71a2a
Compare
|
@masahi I updated the PR with my latest understanding of Hexagon pipeline. Would you mind taking another look? Thanks a lot! |
|
@junrushao I made one comment but otherwise Hexagon change looks good to me. It didn't occur to me before that we can do |
This PR refactors tuning APIs to help with developer ergonomics and enable new potential and usecases.
\## Introduction
**📅 Original behavior.** The original monolithic tuning API assumes that tuning is an end-to-end process that transforms an IRModule into a runtime Module. For example, the API below is designed for Relay end-to-end tuning:
```python
from tvm import meta_scheduler as ms
ms.tune_relay(
mod: IRModule, # The Relay program
target: Union[str, Target], # Parameters used in the Relay program
config: TuneConfig, # Configuration, e.g. number of trials
work_dir: str, # Compilation target
...
) -> runtime.Module: ...
```
**🤔 The challenge.** While striving to be "the" API that controls end-to-end tuning, the design ignores a fact that many users desire to compile an neural network without going through the tuning process, and the fact that MetaSchedule is capable of doing so when supplied with a pre-tuned database.
**🆕 Our refactoring.** Therefore, this PR is introduced to cater those concrete needs by refactoring the monolithic API into 2 or 3 stages, depending how it is used. Take `tune_relay` as another example, now it's refactored into 2 separate APIs, the first of which is slower tuning that returns a database, while the second takes a pre-tuned database for fast Relay compilation.
```python
ms.relay_integration.tune_relay(
mod: IRModule,
params: Dict[str, NDArray],
target: Union[str, Target],
work_dir: str,
max_trials_global: int,
...
) -> Database: ...
ms.relay_integration.compile_relay(
database: Database,
mod: IRModule,
target: Union[Target, str],
params: Optional[Dict[str, NDArray]],
...
) -> runtime.Module: ...
```
\## Upgrade guide
\### If you are using `ms.tune_relay`
The original monolithic API is used as:
```python
lib = ms.tune_relay(
mod=mod,
target=ARGS.target,
config=ms.TuneConfig(
strategy="evolutionary",
num_trials_per_iter=64,
max_trials_per_task=ARGS.num_trials,
max_trials_global=ARGS.num_trials,
adaptive_training=ARGS.adaptive_training,
),
runner=runner,
work_dir=ARGS.work_dir,
params=params,
backend=ARGS.backend,
)
```
And the new design is very much similar with 2 notable differences:
- The monolithic API is split into 2 separate APIs
- It no longer requires a second level configuration, i.e. `TuneConfig`
As a concrete example, the API above should be written as:
```python
database = ms.relay_integration.tune_relay(
mod=mod,
target=ARGS.target,
work_dir=ARGS.work_dir,
max_trials_global=ARGS.num_trials,
num_trials_per_iter=64,
params=params,
runner=runner,
strategy="evolutionary",
)
lib = ms.relay_integration.compile_relay(
database=database,
mod=mod,
target=ARGS.target,
params=params,
backend=ARGS.backend,
)
```
Please refer to changes in `python/tvm/meta_schedule/testing/tune_relay.py` as a practical case.
\### If you are using `ms.tune_extracted_tasks`
As a classic usecase, fluent TVM users may want to extract tasks from Relay first, filter the tasks themselves before sending them to the tuning system. It usually involves 3 APIs:
```python
from tvm import meta_schedule as ms
\# API 1. Task extraction and filtering
extracted_tasks: List[ExtractedTask] = ms.extract_task_from_relay(relay_mod, target, params)
extracted_tasks = [task for task in extracted_tasks if "conv2d" in task.task_name]
\# API 2. Tuning
database = tune_extracted_tasks(
tune_tasks,
ms.TuneConfig(...),
work_dir=work_dir,
num_threads=32,
...,
)
\# API 3. Relay compilation
with database, tvm.transform.PassContext(
opt_level=3,
config={"relay.backend.use_meta_schedule": True},
):
lib = relay.build(relay_mod, target=target, params=params)
```
To provide more flexibility of fine-grained control over the tuning system, we again add an extra API that allows customize the behavior of `ms.ExtractedTask` to `ms.TuneContext` conversion. More specifically, after this refactoring, the APIs are changed into:
```python
\# API 1. Task extraction and filtering
extracted_tasks: List[ExtractedTask] = ms.relay_integration.extract_tasks(relay_mod, target, params)
extracted_tasks = [task for task in extracted_tasks if "conv2d" in task.task_name]
\# API 2. Convert `ms.ExtractedTask` to `ms.TuneContext`
tasks: List[TuneContext]
task_weights: List[float]
tasks, task_weights = ms.relay_integration.extracted_tasks_to_tune_contexts(
extracted_tasks=tune_tasks,
work_dir=work_dir,
space="post-order-apply", # gives the flexibility to customize per-task search space
num_threads=32,
)
\# API 3. Tuning
database = ms.tune.tune_tasks(
tasks=tasks,
task_weights=task_weights,
work_dir=work_dir,
max_trials_global=20000,
)
\# API 4. Relay compilation
lib = ms.relay_integration.compile_relay(
database=database,
mod=mod,
target=ARGS.target,
params=params,
backend=ARGS.backend,
)
```
Please refer to changes in `tests/python/integration/test_meta_schedule_auto_tensorize.py` as a practical case.
\### Misc changes
- `blocks` in `tune_tir` is moved to `ms.space.PostOrderApply(f_block_filter=...)`
- `adaptive_training` in `tune_{relay}/{tir}/{extracted_tasks}` is moved to `ms.cost_model.XGBModel(adaptive_training=...)`
- `sch_rules`/`postprocs`/`mutators` in `tune_{relay}/{tir}/{extracted_tasks}` is moved to `ms.space.PostOrderApply(...)`, and when unspecified, a target-specific default is used.
- `default_config.py` is broken down into `tvm::meta_schedule::{ScheduleRule}/{Mutator}/{Postproc}::Default{LLVM}/{CPU}/{CUDA}`.
7b71a2a to
b3a0191
Compare
|
|
||
|
|
||
| @pytest.mark.skip("Requires cascadelake") | ||
| def test_vnni_schedule_fn_tune(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is broken with the error
> space=ms.space_generator.PostOrderApply(
f_block_filter=None,
sch_rules=None,
postprocs=[],
mutator_probs=None,
),
)
test_meta_schedule_vnni_integration.py:213:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../../python/tvm/meta_schedule/space_generator/post_order_apply.py:53: in __init__
sch_rules, postprocs, mutator_probs = _normalize_rules(sch_rules, postprocs, mutator_probs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
sch_rules = None, postprocs = [], mutator_probs = None
def _normalize_rules(
sch_rules: ScheduleRuleType,
postprocs: PostprocType,
mutator_probs: MutatorProbType,
) -> Tuple[
Optional[List["ScheduleRule"]],
Optional[List["Postproc"]],
Optional[Dict["Mutator", float]],
]:
# pylint: disable=import-outside-toplevel
from ..mutator import Mutator
from ..postproc import Postproc
from ..schedule_rule import ScheduleRule
# pylint: enable=import-outside-toplevel
> assert sch_rules is not None
E AssertionError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will send a fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should work
space=ms.space_generator.PostOrderApply(
f_block_filter=None,
sch_rules="from-target",
postprocs=[],
mutator_probs="from-target",
),
)
| config = ms.TuneConfig( | ||
| strategy="replay_trace", | ||
| target = get_hexagon_target("v68") | ||
| database = ms.tir_integration.tune_tir( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two uses of tune_tir in this file have incorrect signature. I got the following errors:
E TypeError: tune_tir() got an unexpected keyword argument 'sch_rules'
E Check failed: (!checked_type.defined()) is false: Expected Map[meta_schedule.Mutator, FloatImm], but got Array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should work:
target = get_hexagon_target("v68")
database = ms.tir_integration.tune_tir(
mod=workload,
target=target,
max_trials_global=8,
num_trials_per_iter=8,
max_trials_per_task=8,
work_dir=work_dir,
space=ms.space_generator.PostOrderApply(
f_block_filter=None,
sch_rules=sch_rules,
postprocs=postprocs,
mutator_probs={},
),
builder=get_hexagon_local_builder(),
runner=get_hexagon_rpc_runner(hexagon_launcher, number=10),
)
sch = ms.tir_integration.compile_tir(database, workload, target)
| def schedule_rule_dense_vnni(sch: Schedule, dense_block: BlockRV): | ||
| _schedule_dense(m=None, do_tune=True)(sch, dense_block) | ||
| return [sch] | ||
|
|
||
| register_func("meta_schedule.dense_vnni", schedule_rule_dense_vnni) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@junrushao @masahi or others, may I ask what the difference is between using the TE annotation as described (e.g. attrs={"schedule_rule": "meta_schedule.dense_vnni"}, and a corresponding packed func defining the schedule to use, as opposed to just generating the space via
space=ms.space_generator.ScheduleFn(
_schedule_dense,
...
),
?
Is it that in this test case we allow auto scheduling for all ops but apply special manual scheduling for certain ops (dense in this case), whereas if we use the ScheduleFn technique for generating a search space we do not allow other operators to be auto scheduled? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ScheduleFnDatabase is for a completely manual schedule, while the register_func way allows autotvm-style template based tuning. At least that's what I wanted to demonstrate before this PR or before ScheduleFnDatabase was introduced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case I'm not referring to ScheduleFnDatabase as is used in test_vnni_schedule_fn_database. I'm referring here to what is done in the test test_vnni_schedule_fn_tune which utilizes the TE compute schedule_rule attr annotation along with a global packed function for the schedule that matches the annotation value meta_schedule.dense_vnni. I'm wondering if there is any difference or advantage between using the TE attr annotation and packed func as opposed to specifying an alternate search space with ScheduleFn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Chris, ScheduleFn space generator is designed to schedule all blocks in the whole Schedule, not block specific. The annotation based packfunc scheduling only works in PostOrderApply space generator, which essentially applies this annotated rule for this specific block, and apply default schedule rules (or schedule rules given in user interface) to other non-specified blocks.
Therefore, creating a ScheduleFn takes more effort and use the annotation based scheduling is easier because you don't need to worry about scheduling of other blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR refactors tuning APIs to help with developer ergonomics and enable new potential and usecases.
Introduction
📅 Original behavior. The original monolithic tuning API assumes that tuning is an end-to-end process that transforms an IRModule into a runtime Module. For example, the API below is designed for Relay end-to-end tuning:
🤔 The challenge. While striving to be "the" API that controls end-to-end tuning, the design ignores a fact that many users desire to compile an neural network without going through the tuning process, and the fact that MetaSchedule is capable of doing so when supplied with a pre-tuned database.
🆕 Our refactoring. Therefore, this PR is introduced to cater those concrete needs by refactoring the monolithic API into 2 or 3 stages, depending how it is used. Take
tune_relayas another example, now it's refactored into 2 separate APIs, the first of which is slower tuning that returns a database, while the second takes a pre-tuned database for fast Relay compilation.Upgrade guide
If you are using
ms.tune_relayThe original monolithic API is used as:
And the new design is very much similar with 2 notable differences:
TuneConfigAs a concrete example, the API above should be written as:
Please refer to changes in
python/tvm/meta_schedule/testing/tune_relay.pyas a practical case.If you are using
ms.tune_extracted_tasksAs a classic usecase, fluent TVM users may want to extract tasks from Relay first, filter the tasks themselves before sending them to the tuning system. It usually involves 3 APIs:
To provide more flexibility of fine-grained control over the tuning system, we again add an extra API that allows customize the behavior of
ms.ExtractedTasktoms.TuneContextconversion. More specifically, after this refactoring, the APIs are changed into:Please refer to changes in
tests/python/integration/test_meta_schedule_auto_tensorize.pyas a practical case.Misc changes
blocksintune_tiris moved toms.space.PostOrderApply(f_block_filter=...)adaptive_trainingintune_{relay}/{tir}/{extracted_tasks}is moved toms.cost_model.XGBModel(adaptive_training=...)sch_rules/postprocs/mutatorsintune_{relay}/{tir}/{extracted_tasks}is moved toms.space.PostOrderApply(...), and when unspecified, a target-specific default is used.default_config.pyis broken down intotvm::meta_schedule::{ScheduleRule}/{Mutator}/{Postproc}::Default{LLVM}/{CPU}/{CUDA}.Performance Numbers
The PR is tested end-to-end on a subset of representative models to avoid potential regression.
Performance comparison on V100 (AWS P3.2xlarge).
Performance comparison on Intel Skylake (AWS C5.9xlarge):
In summarize, no performance regression is observed after this refactoring.