Skip to content

[CI] Segfault in ci-gpu image related to xgboost version/dmlc-core #6673

@trevor-m

Description

@trevor-m

This issue came up for this PR to add TRT BYOC support: #6395 (comment)
Example failed CI run: https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/PR-6395/30/pipeline

It seems that enabling TRT BYOC codegen (set(USE_TENSORRT_CODEGEN ON)) exposed an unrelated bug found by apps/bundle_deploy/bundle_deploy.py during tests/scro[ts/task_cpp_unittest.sh. The python program segfaults when ran. We believe the issue is not with this test itself, but it just happens to be the first thing that runs a TVM python session and quits after building TVM.

I reproduced the error inside of GDB to get the backtrace.

Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
__GI___libc_free (mem=0x6) at malloc.c:2958
2958    malloc.c: No such file or directory.
(gdb) bt
#0  __GI___libc_free (mem=0x6) at malloc.c:2958
#1  0x00007fffde4937f4 in dmlc::parameter::FieldAccessEntry::~FieldAccessEntry() () from /workspace/build/libtvm.so
#2  0x00007fff9702a4af in dmlc::parameter::FieldEntry<std::string>::~FieldEntry() () from /usr/local/lib/python3.6/dist-packages/xgboost/./lib/libxgboost.so
#3  0x00007fff97037267 in dmlc::parameter::ParamManager::~ParamManager() () from /usr/local/lib/python3.6/dist-packages/xgboost/./lib/libxgboost.so
#4  0x00007ffff6cd7008 in __run_exit_handlers (status=0, listp=0x7ffff70615f8 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true) at exit.c:82
#5  0x00007ffff6cd7055 in __GI_exit (status=<optimized out>) at exit.c:104
#6  0x00007ffff6cbd847 in __libc_start_main (main=0x4d1cb0 <main>, argc=5, argv=0x7fffffffe858, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffe848) at ../csu/libc-start.c:325
#7  0x00000000005e8569 in _start ()

Since I noticed that TVM's setup.py requires a minimum XGBoost version of 1.1.0, I noticed the CI docker only has 1.02. I tried 1.1.0 and 1.2.0 and found that both fixed the issue.

ubuntu@ip-172-31-83-183:~/apps/bundle_deploy$ python3 build_model.py -o build --test
INFO:compile_engine:Using injective.cpu for add based on highest priority (10)
INFO:compile_engine:Using injective.cpu for add based on highest priority (10)
Segmentation fault (core dumped)
ubuntu@ip-172-31-83-183:~/apps/bundle_deploy$ python3
Python 3.6.10 (default, Dec 19 2019, 23:04:32) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import xgboost
>>> xgboost.__version__
'1.0.2'
>>> exit()
ubuntu@ip-172-31-83-183:~/apps/bundle_deploy$ pip3 install --user xgboost==1.1.0
ubuntu@ip-172-31-83-183:~/apps/bundle_deploy$ python3 build_model.py -o build --test                                                                                                                                                                                                   
INFO:compile_engine:Using injective.cpu for add based on highest priority (10)
INFO:compile_engine:Using injective.cpu for add based on highest priority (10)
ubuntu@ip-172-31-83-183:~/apps/bundle_deploy$ pip3 install --user xgboost==1.2.0
ubuntu@ip-172-31-83-183:~/apps/bundle_deploy$ python3 build_model.py -o build --test                                                                                                                                                                                                   
INFO:compile_engine:Using injective.cpu for add based on highest priority (10)
INFO:compile_engine:Using injective.cpu for add based on highest priority (10)

The issue looks similar to this one: https://discuss.xgboost.ai/t/segfault-during-code-cleanup/1365/6
I have encountered this exact error when using TVM with an xgboost or Treelite in the same program when the dmlc-core commits do not match up. It looks maybe like dmlc-core should maybe check for nullptr before deleting the entries?

So it looks like we can fix this by upgrading the xgboost version in the CI containers. It would be good to make the xgboost version consistent with the minimum version 1.1.0 in the setup.py. However, it seems like dmlc-core has some buggy behavior which won't be completely fixed.

@areusch @comaniac @zhiics @tqchen

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions