[CI] Segfault in ci-gpu image related to xgboost version/dmlc-core

This issue came up for this PR to add TRT BYOC support: https://github.com/apache/incubator-tvm/pull/6395#issuecomment-707363920
Example failed CI run: https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/PR-6395/30/pipeline

It seems that enabling TRT BYOC codegen (`set(USE_TENSORRT_CODEGEN ON)`) exposed an unrelated bug found by `apps/bundle_deploy/bundle_deploy.py` during `tests/scro[ts/task_cpp_unittest.sh`. The python program segfaults when ran. We believe the issue is not with this test itself, but it just happens to be the first thing that runs a TVM python session and quits after building TVM.

I reproduced the error inside of GDB to get the backtrace.
```
Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
__GI___libc_free (mem=0x6) at malloc.c:2958
2958    malloc.c: No such file or directory.
(gdb) bt
#0  __GI___libc_free (mem=0x6) at malloc.c:2958
#1  0x00007fffde4937f4 in dmlc::parameter::FieldAccessEntry::~FieldAccessEntry() () from /workspace/build/libtvm.so
#2  0x00007fff9702a4af in dmlc::parameter::FieldEntry<std::string>::~FieldEntry() () from /usr/local/lib/python3.6/dist-packages/xgboost/./lib/libxgboost.so
#3  0x00007fff97037267 in dmlc::parameter::ParamManager::~ParamManager() () from /usr/local/lib/python3.6/dist-packages/xgboost/./lib/libxgboost.so
#4  0x00007ffff6cd7008 in __run_exit_handlers (status=0, listp=0x7ffff70615f8 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true) at exit.c:82
#5  0x00007ffff6cd7055 in __GI_exit (status=<optimized out>) at exit.c:104
#6  0x00007ffff6cbd847 in __libc_start_main (main=0x4d1cb0 <main>, argc=5, argv=0x7fffffffe858, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffe848) at ../csu/libc-start.c:325
#7  0x00000000005e8569 in _start ()
```

Since I noticed that [TVM's setup.py requires a minimum XGBoost version of 1.1.0](https://github.com/apache/incubator-tvm/blob/main/python/setup.py#L172), I noticed the CI docker only has 1.02. I tried 1.1.0 and 1.2.0 and found that both fixed the issue. 
```
ubuntu@ip-172-31-83-183:~/apps/bundle_deploy$ python3 build_model.py -o build --test
INFO:compile_engine:Using injective.cpu for add based on highest priority (10)
INFO:compile_engine:Using injective.cpu for add based on highest priority (10)
Segmentation fault (core dumped)
ubuntu@ip-172-31-83-183:~/apps/bundle_deploy$ python3
Python 3.6.10 (default, Dec 19 2019, 23:04:32) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import xgboost
>>> xgboost.__version__
'1.0.2'
>>> exit()
ubuntu@ip-172-31-83-183:~/apps/bundle_deploy$ pip3 install --user xgboost==1.1.0
ubuntu@ip-172-31-83-183:~/apps/bundle_deploy$ python3 build_model.py -o build --test                                                                                                                                                                                                   
INFO:compile_engine:Using injective.cpu for add based on highest priority (10)
INFO:compile_engine:Using injective.cpu for add based on highest priority (10)
ubuntu@ip-172-31-83-183:~/apps/bundle_deploy$ pip3 install --user xgboost==1.2.0
ubuntu@ip-172-31-83-183:~/apps/bundle_deploy$ python3 build_model.py -o build --test                                                                                                                                                                                                   
INFO:compile_engine:Using injective.cpu for add based on highest priority (10)
INFO:compile_engine:Using injective.cpu for add based on highest priority (10)
```

The issue looks similar to this one: https://discuss.xgboost.ai/t/segfault-during-code-cleanup/1365/6
I have encountered this exact error when using TVM with an xgboost or Treelite in the same program when the dmlc-core commits do not match up. It looks maybe like dmlc-core should maybe check for nullptr before deleting the entries?

So it looks like we can fix this by upgrading the xgboost version in the CI containers. It would be good to make the xgboost version consistent with the minimum version 1.1.0 in the `setup.py`. However, it seems like dmlc-core has some buggy behavior which won't be completely fixed.

@areusch @comaniac @zhiics @tqchen 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Segfault in ci-gpu image related to xgboost version/dmlc-core #6673

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[CI] Segfault in ci-gpu image related to xgboost version/dmlc-core #6673

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions