Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

Gluon 2.0 Dataloader should support BERT training using GluonNLP #18672

@rondogency

Description

@rondogency

Description

Currently we cannot use 2.0 Dataloader to train BERT, and the reason is 2.0 Dataloader is not flexible to support the data schema used by GluonNLP BERT, specifically if passing in a nested list of variable length numpy array, the construction of dataset would fail and throw NDArray conversion errors

Here is a minimal reproducible code, which is the similar data schema BERT pre-training script is using:

import mxnet as mx
import numpy as np
a = np.ndarray(shape=(128,)) # similar to one feature of one sequence
b = np.ndarray(shape=(19,))
l1 = [a,b] # similar to one feature of all sequences
l2 = [a,b]
c = [l1, l2] # similar to a training instance that will be sampled against
ds = mx.gluon.data.ArrayDataset(*c)
dt = mx.gluon.data.DataLoader(dataset=ds, batch_size=1, num_workers=1, try_nopython=True)
print('ok') # error out before prints

References

#17841

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions