Skip to content

Training stop unexpectedly #6

@minhdc

Description

@minhdc

Hi, I prepared the data for training insightface as described in your guide. However, when the training start, this following error occured:

$ CUDA_VISIBLE_DEVICES='' python -u train.py --network r100 --loss arcface --dataset ktnv
use cpu
prefix ./models/r100-arcface-ktnv/model
image_size [112, 112]
num_classes 37
Called with argument: Namespace(batch_size=128, ckpt=3, ctx_num=1, dataset='ktnv', frequent=20, image_channel=3, kvstore='device', loss='arcface', lr=0.1, lr_steps='100000,160000,220000', models_root='./models', mom=0.9, network='r100', per_batch_size=128, pretrained='', pretrained_epoch=1, rescale_threshold=0, verbose=2000, wd=0.0005) {'bn_mom': 0.9, 'workspace': 256, 'emb_size': 512, 'ckpt_embedding': True, 'net_se': 0, 'net_act': 'prelu', 'net_unit': 3, 'net_input': 1, 'net_blocks': [1, 4, 6, 2], 'net_output': 'E', 'net_multiplier': 1.0, 'val_targets': ['lfw'], 'ce_loss': True, 'fc7_lr_mult': 1.0, 'fc7_wd_mult': 1.0, 'fc7_no_bias': False, 'max_steps': 0, 'data_rand_mirror': True, 'data_cutoff': False, 'data_color': 0, 'data_images_filter': 0, 'count_flops': True, 'memonger': False, 'loss_name': 'margin_softmax', 'loss_s': 64.0, 'loss_m1': 1.0, 'loss_m2': 0.5, 'loss_m3': 0.0, 'net_name': 'fresnet', 'num_layers': 100, 'dataset': 'ktnv', 'dataset_path': '../datasets/ktnv', 'num_classes': 37, 'image_shape': [112, 112, 3], 'loss': 'arcface', 'network': 'r100', 'num_workers': 1, 'batch_size': 128, 'per_batch_size': 128}
0 1 E 3 prelu False
Network FLOPs: 24.2G
INFO:root:loading recordio ../datasets/ktnv/train.rec...
header0 label [38. 42.]
id2range 4
37
rand_mirror True
loading bin 0
(456, 3, 112, 112)
ver lfw
lr_steps [100000, 160000, 220000]
call reset()
/home/extreme45nm/anaconda3/lib/python3.7/site-packages/mxnet/module/base_module.py:504: UserWarning: Optimizer created manually outside Module but rescale_grad is not normalized to 1.0/batch_size/num_workers (1.0 vs. 0.0078125). Is this intended?
optimizer_params=optimizer_params)
Traceback (most recent call last):
File "train.py", line 379, in
main()
File "train.py", line 376, in main
train_net(args)
File "train.py", line 370, in train_net
epoch_end_callback = epoch_cb )
File "/home/extreme45nm/anaconda3/lib/python3.7/site-packages/mxnet/module/base_module.py", line 520, in fit
next_data_batch = next(data_iter)
File "/home/extreme45nm/anaconda3/lib/python3.7/site-packages/mxnet/io/io.py", line 230, in next
return self.next()
File "/home/extreme45nm/anaconda3/lib/python3.7/site-packages/mxnet/io/io.py", line 476, in next
raise StopIteration
StopIteration

I am stuck at this point, please give me some hint to solve this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions