I am trying to enable the path mxnet/gluon-nlp --> onnx --> tensorrt.
There is a bug that if I use a pretrained bert model, then running inference with tensor rt in fp16 mode will produce nan's.
Using pretrained weights:
bert, _ = nlp.model.get_model(
name=model_name,
ctx=ctx,
dataset_name=dataset,
**pretrained=True,**
use_pooler=True,
use_decoder=False,
num_layers=3, # hardcode this as 3 layer since this is what the customer uses
use_classifier=False,
hparam_allow_override=True)
model = bert
Not using pretrained weights:
bert, _ = nlp.model.get_model(
name=model_name,
ctx=ctx,
dataset_name=dataset,
**pretrained=False,**
use_pooler=True,
use_decoder=False,
num_layers=3, # hardcode this as 3 layer since this is what the customer uses
use_classifier=False,
hparam_allow_override=True)
model = bert
**model.initialize(ctx=ctx)**
More specifically, WITHOUT pretrained weights, tensor rt can produce reasonable outputs in both fp16 mode and regular fp32 mode. However, WITH pretrained weights, tensor rt will produce nan ouputs in fp16 mode, but fp32 mode seems to work fine. Furthermore, it seems like this nan issue is triggered by the size of seq_length: when seq_length<=16 even fp16 mode will produce reasonable outputs; when seq_length>17, fp 16 mode will start to produce nan's. batch batch size seems to not affect the nan behavior.
Reproducible code and steps can be found here #19746. Because we have a customer requesting this feature, it would be great if friends at Nvidia can help look into this issue. Please let me know how I can provide further info/help
@sandeep-krishnamurthy @MoisesHer @Kh4L @chinakook