[BugFix] resolve integer 32. ~ 64. mismatch by casting #9582

ganler · 2021-11-24T22:24:43Z

Environment & Flags

Model:

Target: llvm
TVM git hash: 0cb6337
Compler: clang version 13.0.0
uname -a: Linux ise-manjaro 5.10.70-1-MANJARO #1 SMP PREEMPT Thu Sep 30 15:29:01 UTC 2021 x86_64 GNU/Linux

Bug Description

When compiling this model, TVM tends to do some type (shape) inference related to NCHWc. In NCHWc->NCHW shape inference, c + C will be evaluated.

However, model shape axis values are int64 but some other values initialized by default is int32. So here's a data type mismatch. This data type mismatch will cause TVM to fail in ExprMutator since binary operators require matched types. for operands.

Fix

Initialize those artificial constants as int64 instead of int32 (PrimExpr(0) will be regarded as int32.).

Full failure log [click to expand]

> INIT:: # Edge in this DSO: 1244181; # Edge total: 1244181
One or more operators have not been tuned. Please tune your model for better performance. Use DEBUG logging level to see more details.
Traceback (most recent call last):
  File "nnsmith/backend_executor.py", line 60, in <module>
    run_backend_same_proc(args.model, args.input, bknd)
  File "/home/ganler/Documents/nnsmith/nnsmith/difftest.py", line 62, in run_backend_same_proc
    outputs = backend.predict(model_path, inputs)
  File "/home/ganler/Documents/nnsmith/nnsmith/backends/tvm_graph.py", line 74, in predict
    self.load_model(model)
  File "/home/ganler/Documents/nnsmith/nnsmith/backends/tvm_graph.py", line 68, in load_model
    executor = relay.build_module.create_executor(
  File "/home/ganler/Documents/tvm/python/tvm/relay/backend/interpreter.py", line 171, in evaluate
    return self._make_executor()
  File "/home/ganler/Documents/tvm/python/tvm/relay/build_module.py", line 591, in _make_executor
    mod = build(self.mod, target=self.target)
  File "/home/ganler/Documents/tvm/python/tvm/relay/build_module.py", line 449, in build
    graph_json, runtime_mod, params = bld_mod.build(
  File "/home/ganler/Documents/tvm/python/tvm/relay/build_module.py", line 189, in build
    self._build(mod, target, target_host, executor, runtime, mod_name)
  File "/home/ganler/Documents/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 237, in __call__
    raise get_last_ffi_error()
tvm._ffi.base.TVMError: Traceback (most recent call last):
  16: TVMFuncCall
  15: _ZNSt17_Function_handlerIFvN3tvm7runtime7TVMArgsEPNS1_11TVMRetValue
  14: tvm::relay::backend::RelayBuildModule::GetFunction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#3}::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
  13: tvm::relay::backend::RelayBuildModule::Build(tvm::IRModule, tvm::runtime::Map<tvm::Integer, tvm::Target, void, void> const&, tvm::Target const&, tvm::relay::Executor const&, tvm::relay::Runtime const&, tvm::runtime::String)
  12: tvm::relay::backend::RelayBuildModule::BuildRelay(tvm::IRModule, tvm::runtime::String const&)
  11: tvm::relay::backend::RelayBuildModule::OptimizeImpl(tvm::IRModule)
  10: tvm::transform::Pass::operator()(tvm::IRModule) const
  9: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  8: tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  7: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  6: tvm::relay::transform::FunctionPassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  5: tvm::transform::Pass::operator()(tvm::IRModule) const
  4: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  3: tvm::transform::ModulePassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  2: std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::runtime::TypedPackedFunc<tvm::IRModule (tvm::IRModule, tvm::transform::PassContext)>::AssignTypedLambda<tvm::relay::transform::InferType()::$_1>(tvm::relay::transform::InferType()::$_1)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)
  1: tvm::relay::TypeInferencer::Infer(tvm::GlobalVar, tvm::relay::Function)
  0: tvm::relay::TypeSolver::Solve()
  28: TVMFuncCall
  27: _ZNSt17_Function_handlerIFvN3tvm7runtime7TVMArgsEPNS1_11TVMRetValue
  26: tvm::relay::backend::RelayBuildModule::GetFunction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#3}::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
  25: tvm::relay::backend::RelayBuildModule::Build(tvm::IRModule, tvm::runtime::Map<tvm::Integer, tvm::Target, void, void> const&, tvm::Target const&, tvm::relay::Executor const&, tvm::relay::Runtime const&, tvm::runtime::String)
  24: tvm::relay::backend::RelayBuildModule::BuildRelay(tvm::IRModule, tvm::runtime::String const&)
  23: tvm::relay::backend::RelayBuildModule::OptimizeImpl(tvm::IRModule)
  22: tvm::transform::Pass::operator()(tvm::IRModule) const
  21: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  20: tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  19: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  18: tvm::relay::transform::FunctionPassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  17: tvm::transform::Pass::operator()(tvm::IRModule) const
  16: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  15: tvm::transform::ModulePassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  14: std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::runtime::TypedPackedFunc<tvm::IRModule (tvm::IRModule, tvm::transform::PassContext)>::AssignTypedLambda<tvm::relay::transform::InferType()::$_1>(tvm::relay::transform::InferType()::$_1)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)
  13: tvm::relay::TypeInferencer::Infer(tvm::GlobalVar, tvm::relay::Function)
  12: tvm::relay::TypeSolver::Solve()
  11: tvm::TypedEnvFunc<bool (tvm::runtime::Array<tvm::Type, void> const&, int, tvm::Attrs const&, tvm::TypeReporter const&)>::operator()(tvm::runtime::Array<tvm::Type, void> const&, int, tvm::Attrs const&, tvm::TypeReporter const&) const
  10: tvm::runtime::TypedPackedFunc<bool (tvm::runtime::Array<tvm::Type, void> const&, int, tvm::Attrs const&, tvm::TypeReporter const&)>::AssignTypedLambda<bool (*)(tvm::runtime::Array<tvm::Type, void> const&, int, tvm::Attrs const&, tvm::TypeReporter const&)>(bool (*)(tvm::runtime::Array<tvm::Type, void> const&, int, tvm::Attrs const&, tvm::TypeReporter const&))::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}::operator()(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*) const
  9: bool tvm::relay::Conv2DWinogradRel<tvm::relay::Conv2DAttrs>(tvm::runtime::Array<tvm::Type, void> const&, int, tvm::Attrs const&, tvm::TypeReporter const&)
  8: tvm::tir::BijectiveLayout::ForwardShape(tvm::runtime::Array<tvm::PrimExpr, void> const&) const
  7: tvm::tir::TransformShape(tvm::runtime::Array<tvm::PrimExpr, void> const&, tvm::runtime::Array<tvm::tir::IterVar, void> const&, tvm::runtime::Array<tvm::tir::IterVar, void> const&, tvm::runtime::Array<tvm::PrimExpr, void> const&)
  6: tvm::PrimExpr tvm::tir::Substitute<tvm::PrimExpr>(tvm::PrimExpr, std::unordered_map<tvm::tir::VarNode const*, tvm::PrimExpr, std::hash<tvm::tir::VarNode const*>, std::equal_to<tvm::tir::VarNode const*>, std::allocator<std::pair<tvm::tir::VarNode const* const, tvm::PrimExpr> > > const&)
  5: tvm::tir::Substitute(tvm::PrimExpr, std::function<tvm::runtime::Optional<tvm::PrimExpr> (tvm::tir::Var const&)>)
  4: non-virtual thunk to tvm::tir::StmtExprMutator::VisitExpr(tvm::PrimExpr const&)
  3: tvm::NodeFunctor<tvm::PrimExpr (tvm::runtime::ObjectRef const&, tvm::tir::ExprFunctor<tvm::PrimExpr (tvm::PrimExpr const&)>*)>::operator()(tvm::runtime::ObjectRef const&, tvm::tir::ExprFunctor<tvm::PrimExpr (tvm::PrimExpr const&)>*) const
  2: _ZZN3tvm3tir11ExprFunctorIFNS_8PrimExprERKS2_EE10I
  1: tvm::tir::ExprMutator::VisitExpr_(tvm::tir::AddNode const*)
  0: tvm::tir::Add::Add(tvm::PrimExpr, tvm::PrimExpr, tvm::Span)
  File "/home/ganler/Documents/tvm/src/relay/analysis/type_solver.cc", line 622
TVMError: 
---------------------------------------------------------------
An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html
---------------------------------------------------------------
  Check failed: (false) is false: [16:20:42] /home/ganler/Documents/tvm/src/tir/ir/expr.cc:226: 
---------------------------------------------------------------
An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html
---------------------------------------------------------------
  Check failed: (a.dtype() == b.dtype()) is false: TypeError: mismatched types. int64 vs. int32

cc: @YuchenJin @junrushao1994

…int64 symbols

Mousius

Hi @ganler,

Could you please construct a test case representing this case?

src/tir/ir/expr.cc

ganler · 2021-11-27T01:57:29Z

It is a bit strange that this change seems to influence meta scheduler's results (according to CI). I am not sure if those test oracles are related to correctness.

junrushao · 2021-11-27T02:01:29Z

TVM has a pretty fragile system of using i32 vs i64; I personally experienced it a few times before...

The meta schedule unittest is about extracting tasks from Relay. If it fails, it's probably because the number of graph partitions in the relay lowering pipeline goes wrong.

ganler · 2021-11-27T02:43:08Z

Sorry, I think my prior idea is bad. I came up with a more compatible fix that simply matches key-value types in ir substitution.

Thanks @junrushao1994 for the suggestions.

ganler · 2021-11-27T02:44:54Z

TVM has a pretty fragile system of using i32 vs i64; I personally experienced it a few times before...

In most cases I think it's fine since when we call a + b, we are actually calling add (in op.cc) which will help to simplify the expression and do necessary casting. Here that i32 and i64 are "fighting with each other" is because type mismatch in IR substitution.

That said, in IR substitution, it is ill-formed to have vmap[int32_var] = int64_var. So we can do vmap[key] = cast<key.dtype>(value) during substitution map construction. I am not sure if there's other cases like this.

ganler · 2021-11-30T18:37:53Z

Hi @ganler,

Could you please construct a test case representing this case?

@Mousius The test case has been added. :-)

ganler · 2021-12-22T22:21:12Z

@YuchenJin @junrushao1994 Would you help review this PR if you are available (The CI finally works...)

YuchenJin · 2022-01-02T00:28:57Z

Hi @ganler, thanks for the work! Is this bug found by your fuzzer? :)

For the regression test case, it is now implemented as compiling an e2e relay model. If this bug is specific to a pass (layout transformation pass in this case), it might be more clear and effective to write a unit test for the layout transformation pass alone.

ganler · 2022-01-02T00:32:22Z

This bug is found by using an e2e model generated by PyTorch. For simplicity I converted it to Relay. I think I have located the root cause which is that the implementation of expression substitution did not make sure the type match. :-)

ganler · 2022-01-02T00:33:23Z

I will try only using one specific pass to minimize the unit test. Thank you for the suggestion! @YuchenJin

ganler · 2022-01-03T02:32:16Z

@YuchenJin It seems that this bug is triggered by the combination of 2 passes after my minimization (the de facto order of O3 optimizaiton):

with tvm.transform.PassContext(opt_level=3):
    with tvm.target.Target('llvm'):
        mod = relay.transform.CanonicalizeOps()(mod)
        mod = relay.transform.AlterOpLayout()(mod)

YuchenJin

Hi @ganler, thanks for narrowing down the cause of this bug. LGTM and thanks for digging into it!

fix: integer mismatch in type inference by lifting constant int32 to …

4cd55c3

…int64 symbols

ganler requested review from Hzfengsy, ZihengJiang, junrushao, kparzysz-quic, masahi, tqchen and vinx13 as code owners November 24, 2021 22:24

Mousius reviewed Nov 25, 2021

View reviewed changes

ganler added 2 commits November 26, 2021 15:07

refine Ramp node checking to pass tests

2402511

add test of integer compatibility testing for layout transform

7e6a7da

ganler requested review from areusch, comaniac, icemelon, jroesch, merrymercy and yzhliu as code owners November 26, 2021 23:09

ganler commented Nov 26, 2021

View reviewed changes

src/tir/ir/expr.cc Outdated Show resolved Hide resolved

Merge branch 'apache:main' into int-mismatch

b3af494

refine: use a compatible fix

24797ae

ganler changed the title ~~[BugFix] integer mismatch in type inference by lifting constant int32 to int64~~ [BugFix] resolve integer 32. ~ 64. mismatch by casting Nov 27, 2021

ganler added 4 commits December 13, 2021 01:01

minimize change

71fc565

[EMPTY] trigger bad CI

43cd938

Merge branch 'main' into int-mismatch

fca8a20

Merge branch 'apache:main' into int-mismatch

462ca87

refact: minimize int mismatch test case

d9d69a6

YuchenJin approved these changes Jan 6, 2022

View reviewed changes

tqchen approved these changes Jan 6, 2022

View reviewed changes

tqchen merged commit 07a46a1 into apache:main Jan 6, 2022

ylc pushed a commit to ylc/tvm that referenced this pull request Jan 7, 2022

[BugFix] resolve integer 32. ~ 64. mismatch by casting (apache#9582)

812d09e

ylc pushed a commit to ylc/tvm that referenced this pull request Jan 13, 2022

[BugFix] resolve integer 32. ~ 64. mismatch by casting (apache#9582)

0211f2a

lazycal mentioned this pull request Feb 4, 2022

[TIR] Fix Ramp int32~64 mismatch in VectorizeLoop and NarrowDataType passes #10172

Merged

ganler mentioned this pull request Mar 7, 2022

[BugFix]: select node type error in NarrowDataType pass #10519

Merged

masahi mentioned this pull request Mar 11, 2022

[TE] Promote substituted variable to iter_var's dtype #10571

Merged

This was referenced Apr 12, 2022

[FIX] resolve int64/32 for AttrStmtNode #10983

Merged

[Bug] narrow thread extents to 32 bits for GPU lowering #10969

Closed

ganler deleted the int-mismatch branch May 1, 2022 20:25

[BugFix] resolve integer 32. ~ 64. mismatch by casting #9582

[BugFix] resolve integer 32. ~ 64. mismatch by casting #9582

Uh oh!

Conversation

ganler commented Nov 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Environment & Flags

Bug Description

Fix

Uh oh!

Mousius left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ganler commented Nov 27, 2021

Uh oh!

junrushao commented Nov 27, 2021

Uh oh!

ganler commented Nov 27, 2021

Uh oh!

ganler commented Nov 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ganler commented Nov 30, 2021

Uh oh!

ganler commented Dec 22, 2021

Uh oh!

YuchenJin commented Jan 2, 2022

Uh oh!

ganler commented Jan 2, 2022

Uh oh!

ganler commented Jan 2, 2022

Uh oh!

ganler commented Jan 3, 2022

Uh oh!

YuchenJin left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ganler commented Nov 24, 2021 •

edited

Loading

ganler commented Nov 27, 2021 •

edited

Loading