-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Description
This issue came up for this PR to add TRT BYOC support: #6395 (comment)
Example failed CI run: https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/PR-6395/30/pipeline
It seems that enabling TRT BYOC codegen (set(USE_TENSORRT_CODEGEN ON)) exposed an unrelated bug found by apps/bundle_deploy/bundle_deploy.py during tests/scro[ts/task_cpp_unittest.sh. The python program segfaults when ran. We believe the issue is not with this test itself, but it just happens to be the first thing that runs a TVM python session and quits after building TVM.
I reproduced the error inside of GDB to get the backtrace.
Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
__GI___libc_free (mem=0x6) at malloc.c:2958
2958 malloc.c: No such file or directory.
(gdb) bt
#0 __GI___libc_free (mem=0x6) at malloc.c:2958
#1 0x00007fffde4937f4 in dmlc::parameter::FieldAccessEntry::~FieldAccessEntry() () from /workspace/build/libtvm.so
#2 0x00007fff9702a4af in dmlc::parameter::FieldEntry<std::string>::~FieldEntry() () from /usr/local/lib/python3.6/dist-packages/xgboost/./lib/libxgboost.so
#3 0x00007fff97037267 in dmlc::parameter::ParamManager::~ParamManager() () from /usr/local/lib/python3.6/dist-packages/xgboost/./lib/libxgboost.so
#4 0x00007ffff6cd7008 in __run_exit_handlers (status=0, listp=0x7ffff70615f8 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true) at exit.c:82
#5 0x00007ffff6cd7055 in __GI_exit (status=<optimized out>) at exit.c:104
#6 0x00007ffff6cbd847 in __libc_start_main (main=0x4d1cb0 <main>, argc=5, argv=0x7fffffffe858, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffe848) at ../csu/libc-start.c:325
#7 0x00000000005e8569 in _start ()
Since I noticed that TVM's setup.py requires a minimum XGBoost version of 1.1.0, I noticed the CI docker only has 1.02. I tried 1.1.0 and 1.2.0 and found that both fixed the issue.
ubuntu@ip-172-31-83-183:~/apps/bundle_deploy$ python3 build_model.py -o build --test
INFO:compile_engine:Using injective.cpu for add based on highest priority (10)
INFO:compile_engine:Using injective.cpu for add based on highest priority (10)
Segmentation fault (core dumped)
ubuntu@ip-172-31-83-183:~/apps/bundle_deploy$ python3
Python 3.6.10 (default, Dec 19 2019, 23:04:32)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import xgboost
>>> xgboost.__version__
'1.0.2'
>>> exit()
ubuntu@ip-172-31-83-183:~/apps/bundle_deploy$ pip3 install --user xgboost==1.1.0
ubuntu@ip-172-31-83-183:~/apps/bundle_deploy$ python3 build_model.py -o build --test
INFO:compile_engine:Using injective.cpu for add based on highest priority (10)
INFO:compile_engine:Using injective.cpu for add based on highest priority (10)
ubuntu@ip-172-31-83-183:~/apps/bundle_deploy$ pip3 install --user xgboost==1.2.0
ubuntu@ip-172-31-83-183:~/apps/bundle_deploy$ python3 build_model.py -o build --test
INFO:compile_engine:Using injective.cpu for add based on highest priority (10)
INFO:compile_engine:Using injective.cpu for add based on highest priority (10)
The issue looks similar to this one: https://discuss.xgboost.ai/t/segfault-during-code-cleanup/1365/6
I have encountered this exact error when using TVM with an xgboost or Treelite in the same program when the dmlc-core commits do not match up. It looks maybe like dmlc-core should maybe check for nullptr before deleting the entries?
So it looks like we can fix this by upgrading the xgboost version in the CI containers. It would be good to make the xgboost version consistent with the minimum version 1.1.0 in the setup.py. However, it seems like dmlc-core has some buggy behavior which won't be completely fixed.