Environment info
Operating System:
Ubuntu 16.04
Compiler:
GCC 5.4
Package used (Python/R/Scala/Julia):
Python
Or if installed from source:
MXNet commit hash (git rev-parse HEAD):
0418aae16c2c6a01bf2e937d6e05596ec21e9087
8713d25 (0.10 release)
Python version and distribution:
2.7.12 (default, Nov 19 2016, 06:48:10)
[GCC 5.4.0 20160609]
Error Message:
[19:20:36] /code/mxnet/src/executor/graph_executor.cc:558: Bucketing: data gt_boxes has a shape (1,123,5), which is larger than already allocated shape (1,100,5). Need to re-allocate. Consider putting default bucket key to be the bucket taking the largest input for better memory sharing.
[19:20:36] /code/mxnet/src/executor/graph_executor.cc:558: Bucketing: data gt_boxes has a shape (1,123,5), which is larger than already allocated shape (1,100,5). Need to re-allocate. Consider putting default bucket key to be the bucket taking the largest input for better memory sharing.
[19:20:36] /code/mxnet/src/executor/graph_executor.cc:558: Bucketing: data gt_boxes has a shape (1,123,5), which is larger than already allocated shape (1,100,5). Need to re-allocate. Consider putting default bucket key to be the bucket taking the largest input for better memory sharing.
[19:20:36] /code/mxnet/src/executor/graph_executor.cc:558: Bucketing: data gt_boxes has a shape (1,123,5), which is larger than already allocated shape (1,100,5). Need to re-allocate. Consider putting default bucket key to be the bucket taking the largest input for better memory sharing.
[19:20:36] /code/mxnet/src/executor/graph_executor.cc:558: Bucketing: data gt_boxes has a shape (1,123,5), which is larger than already allocated shape (1,100,5). Need to re-allocate. Consider putting default bucket key to be the bucket taking the largest input for better memory sharing.
[19:20:36] /code/mxnet/src/executor/graph_executor.cc:558: Bucketing: data gt_boxes has a shape (1,123,5), which is larger than already allocated shape (1,100,5). Need to re-allocate. Consider putting default bucket key to be the bucket taking the largest input for better memory sharing.
[19:20:36] /code/mxnet/src/executor/graph_executor.cc:558: Bucketing: data gt_boxes has a shape (1,123,5), which is larger than already allocated shape (1,100,5). Need to re-allocate. Consider putting default bucket key to be the bucket taking the largest input for better memory sharing.
[19:20:36] /code/mxnet/src/executor/graph_executor.cc:558: Bucketing: data gt_boxes has a shape (1,123,5), which is larger than already allocated shape (1,100,5). Need to re-allocate. Consider putting default bucket key to be the bucket taking the largest input for better memory sharing.
[19:20:36] /code/mxnet/dmlc-core/include/dmlc/logging.h:304: [19:20:36] /code/mxnet/mshadow/mshadow/././././cuda/tensor_gpu-inl.cuh:110: Check failed: err == cudaSuccess (7 vs. 0) Name: MapPlanKernel ErrStr:too many resources requested for launch
Stack trace returned 9 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f81998b95dc]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN7mshadow4cuda7MapPlanINS_2sv6plustoENS_6TensorINS_3gpuELi2EfEENS_4expr14Broadcast1DExpINS4_IS5_Li1EfEEfLi2ELi1EEEfEEvNS7_4PlanIT0_T2_EERKNSB_IT1_SD_EENS_5ShapeILi2EEEP11CUstream_st+0x1bc) [0x7f819a61351c]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(ZN5mxnet2op16FullyConnectedOpIN7mshadow3gpuEfE7ForwardERKNS_9OpContextERKSt6vectorINS_5TBlobESaIS9_EERKS8_INS_9OpReqTypeESaISE_EESD_SD+0x972) [0x7f819a614062]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(+0x6f7c19) [0x7f8199949c19]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x87) [0x7f819992b337]
[bt] (5) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZNSt17_Function_handlerIFvvEZZN5mxnet6engine23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlvE_E9_M_invokeERKSt9_Any_data+0x78) [0x7f819992fab8]
[bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f81eb1d2c80]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f81ef2326ba]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f81eef6882d]
[19:20:36] /code/mxnet/dmlc-core/include/dmlc/logging.h:304: [19:20:36] /code/mxnet/src/engine/./threaded_engine.h:329: [19:20:36] /code/mxnet/mshadow/mshadow/././././cuda/tensor_gpu-inl.cuh:110: Check failed: err == cudaSuccess (7 vs. 0) Name: MapPlanKernel ErrStr:too many resources requested for launch
Stack trace returned 9 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f81998b95dc]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN7mshadow4cuda7MapPlanINS_2sv6plustoENS_6TensorINS_3gpuELi2EfEENS_4expr14Broadcast1DExpINS4_IS5_Li1EfEEfLi2ELi1EEEfEEvNS7_4PlanIT0_T2_EERKNSB_IT1_SD_EENS_5ShapeILi2EEEP11CUstream_st+0x1bc) [0x7f819a61351c]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(ZN5mxnet2op16FullyConnectedOpIN7mshadow3gpuEfE7ForwardERKNS_9OpContextERKSt6vectorINS_5TBlobESaIS9_EERKS8_INS_9OpReqTypeESaISE_EESD_SD+0x972) [0x7f819a614062]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(+0x6f7c19) [0x7f8199949c19]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x87) [0x7f819992b337]
[bt] (5) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZNSt17_Function_handlerIFvvEZZN5mxnet6engine23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlvE_E9_M_invokeERKSt9_Any_data+0x78) [0x7f819992fab8]
[bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f81eb1d2c80]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f81ef2326ba]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f81eef6882d]
An fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.
Stack trace returned 6 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f81998b95dc]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x31a) [0x7f819992b5ca]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZNSt17_Function_handlerIFvvEZZN5mxnet6engine23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlvE_E9_M_invokeERKSt9_Any_data+0x78) [0x7f819992fab8]
[bt] (3) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f81eb1d2c80]
[bt] (4) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f81ef2326ba]
[bt] (5) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f81eef6882d]
terminate called after throwing an instance of 'dmlc::Error'
what(): [19:20:36] /code/mxnet/src/engine/./threaded_engine.h:329: [19:20:36] /code/mxnet/mshadow/mshadow/././././cuda/tensor_gpu-inl.cuh:110: Check failed: err == cudaSuccess (7 vs. 0) Name: MapPlanKernel ErrStr:too many resources requested for launch
Stack trace returned 9 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f81998b95dc]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN7mshadow4cuda7MapPlanINS_2sv6plustoENS_6TensorINS_3gpuELi2EfEENS_4expr14Broadcast1DExpINS4_IS5_Li1EfEEfLi2ELi1EEEfEEvNS7_4PlanIT0_T2_EERKNSB_IT1_SD_EENS_5ShapeILi2EEEP11CUstream_st+0x1bc) [0x7f819a61351c]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(ZN5mxnet2op16FullyConnectedOpIN7mshadow3gpuEfE7ForwardERKNS_9OpContextERKSt6vectorINS_5TBlobESaIS9_EERKS8_INS_9OpReqTypeESaISE_EESD_SD+0x972) [0x7f819a614062]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(+0x6f7c19) [0x7f8199949c19]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x87) [0x7f819992b337]
[bt] (5) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZNSt17_Function_handlerIFvvEZZN5mxnet6engine23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlvE_E9_M_invokeERKSt9_Any_data+0x78) [0x7f819992fab8]
[bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f81eb1d2c80]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f81ef2326ba]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f81eef6882d]
An fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.
Stack trace returned 6 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f81998b95dc]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x31a) [0x7f819992b5ca]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZNSt17_Function_handlerIFvvEZZN5mxnet6engine23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlvE_E9_M_invokeERKSt9_Any_data+0x78) [0x7f819992fab8]
[bt] (3) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f81eb1d2c80]
[bt] (4) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f81ef2326ba]
[bt] (5) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f81eef6882d]
Minimum reproducible example
`
import argparse
import pprint
import mxnet as mx
import numpy as np
import glob
import sys
sys.path.append('/code/mxnet/example/rcnn/')
from rcnn.logger import logger
from rcnn.config import config, default, generate_config
from rcnn.symbol import *
from rcnn.core import callback, metric
from rcnn.core.loader import AnchorLoader
from rcnn.core.module import MutableModule
from rcnn.utils.load_data import load_gt_roidb, merge_roidb, filter_roidb
from rcnn.utils.load_model import load_param
from rcnn.dataset.imdb import IMDB
import xmltodict
from PIL import Image
import cPickle
import os
classes = ['human--person', 'human--rider--bicyclist', 'human--rider--motorcyclist',
'human--rider--other-rider', 'object--pothole', 'object--street-light', 'object--traffic-light',
'object--traffic-sign--back', 'object--traffic-sign--front', 'object--vehicle--bicycle',
'object--vehicle--boat', 'object--vehicle--bus', 'object--vehicle--car',
'object--vehicle--caravan', 'object--vehicle--motorcycle', 'object--vehicle--on-rails',
'object--vehicle--other-vehicle', 'object--vehicle--trailer', 'object--vehicle--truck',
'object--vehicle--wheeled-slow']
class mapillary(IMDB):
def init(self, classes, image_set='training', root_path='./', data_path='./'):
super(mapillary, self).init('mapillary', image_set, root_path, data_path)
self.root_path=root_path
self.image_set = image_set
self.data_path=data_path
self.classes = ['void'] + classes
self.num_classes = len(self.classes)
self.image_files = glob.glob(data_path + image_set + '/images/')
self.num_images = len(self.image_files)
label_files = glob.glob(data_path + 'pre-processed-for-training/pascal_ssd/' + image_set + '/')
self.label_files = {}
for lbl in label_files:
self.label_files[os.path.splitext(os.path.basename(lbl))[0]] = lbl
self.image_set_index = self.load_image_set_index()
def load_image_set_index(self):
"""
find out which indexes correspond to given image set (train or val)
:return:
"""
image_set_index = range(0, len(self.image_files))
return image_set_index
def image_path_from_index(self, index):
"""
given image index, find out full path
:param index: index of a specific image
:return: full path of this image
"""
image_file = self.image_files[index]
assert os.path.exists(image_file), 'Path does not exist: {}'.format(image_file)
return image_file
def gt_roidb(self):
"""
return ground truth image regions database
:return: imdb[image_index]['boxes', 'gt_classes', 'gt_overlaps', 'flipped']
"""
cache_file = os.path.join(self.cache_path, self.name + '_gt_roidb.pkl')
if os.path.exists(cache_file):
with open(cache_file, 'rb') as fid:
roidb = cPickle.load(fid)
logger.info('%s gt roidb loaded from %s' % (self.name, cache_file))
return roidb
gt_roidb = [self.load_pascal_annotation(index) for index in self.image_set_index]
with open(cache_file, 'wb') as fid:
cPickle.dump(gt_roidb, fid, cPickle.HIGHEST_PROTOCOL)
logger.info('%s wrote gt roidb to %s' % (self.name, cache_file))
return gt_roidb
def load_pascal_annotation(self, image_index):
image_path = self.image_files[image_index]
name = os.path.splitext(os.path.basename(self.image_files[image_index]))[0]
import xml.etree.ElementTree as ET
roi_rec = dict()
roi_rec['image'] = image_path
im = Image.open(image_path)
width, height = im.size
roi_rec['height'] = height
roi_rec['width'] = width
tree = ET.parse(self.label_files[name])
objs = tree.findall('object')
num_objs = len(objs)
boxes = np.zeros((num_objs, 4), dtype=np.uint16)
gt_classes = np.zeros((num_objs), dtype=np.int32)
overlaps = np.zeros((num_objs, self.num_classes), dtype=np.float32)
class_to_index = dict(zip(self.classes, range(self.num_classes)))
# Load object bounding boxes into a data frame.
for ix, obj in enumerate(objs):
bbox = obj.find('bndbox')
# Make pixel indexes 0-based
x1 = float(bbox.find('xmin').text) - 1
y1 = float(bbox.find('ymin').text) - 1
x2 = float(bbox.find('xmax').text) - 1
y2 = float(bbox.find('ymax').text) - 1
cls = class_to_index[obj.find('name').text.lower().strip()]
boxes[ix, :] = [x1, y1, x2, y2]
gt_classes[ix] = cls
overlaps[ix, cls] = 1.0
roi_rec.update({'boxes': boxes,
'gt_classes': gt_classes,
'gt_overlaps': overlaps,
'max_classes': overlaps.argmax(axis=1),
'max_overlaps': overlaps.max(axis=1),
'flipped': False})
return roi_rec
config.TRAIN.BATCH_IMAGES = 1
config.TRAIN.BATCH_ROIS = 128
config.TRAIN.END2END = True
config.TRAIN.BBOX_NORMALIZATION_PRECOMPUTED = True
ctx = [mx.gpu(int(i)) for i in range(8)]
network = default.network
default.pretrained = '/mnt/network_data/mxnet/models/vgg16'
import time
date = time.strftime("%Y-%m-%d")
if not os.path.exists(date):
os.makedirs(date)
prefix = date + '/rcnn-' + network
print(prefix)
lr = 0.001
lr_step = '5'
sym = eval('get_' + network + '_train')(num_classes=config.NUM_CLASSES, num_anchors=config.NUM_ANCHORS)
feat_sym = sym.get_internals()['rpn_cls_score_output']
batch_size = len(ctx)
input_batch_size = config.TRAIN.BATCH_IMAGES * batch_size
logger.info(pprint.pformat(config))
image_sets = [mapillary(classes), mapillary(classes, 'validation')]
roidbs = [image_set.gt_roidb() for image_set in image_sets]
roidb = merge_roidb(roidbs)
roidb = filter_roidb(roidb)
train_data = AnchorLoader(feat_sym, roidb, batch_size=input_batch_size, shuffle=True,
ctx=ctx, work_load_list=None,
feat_stride=config.RPN_FEAT_STRIDE, anchor_scales=config.ANCHOR_SCALES,
anchor_ratios=config.ANCHOR_RATIOS, aspect_grouping=config.TRAIN.ASPECT_GROUPING)
max_data_shape = [('data', (input_batch_size, 3, max([v[0] for v in config.SCALES]), max([v[1] for v in config.SCALES])))]
max_data_shape, max_label_shape = train_data.infer_shape(max_data_shape)
max_data_shape.append(('gt_boxes', (input_batch_size, 100, 5)))
logger.info('providing maximum shape %s %s' % (max_data_shape, max_label_shape))
data_shape_dict = dict(train_data.provide_data + train_data.provide_label)
arg_shape, out_shape, aux_shape = sym.infer_shape(**data_shape_dict)
arg_shape_dict = dict(zip(sym.list_arguments(), arg_shape))
out_shape_dict = dict(zip(sym.list_outputs(), out_shape))
aux_shape_dict = dict(zip(sym.list_auxiliary_states(), aux_shape))
logger.info('output shape %s' % pprint.pformat(out_shape_dict))
begin_epoch = 0
end_epoch = default.e2e_epoch
arg_params, aux_params = load_param(default.pretrained, default.pretrained_epoch, convert=True)
arg_params['rpn_conv_3x3_weight'] = mx.random.normal(0, 0.01, shape=arg_shape_dict['rpn_conv_3x3_weight'])
arg_params['rpn_conv_3x3_bias'] = mx.nd.zeros(shape=arg_shape_dict['rpn_conv_3x3_bias'])
arg_params['rpn_cls_score_weight'] = mx.random.normal(0, 0.01, shape=arg_shape_dict['rpn_cls_score_weight'])
arg_params['rpn_cls_score_bias'] = mx.nd.zeros(shape=arg_shape_dict['rpn_cls_score_bias'])
arg_params['rpn_bbox_pred_weight'] = mx.random.normal(0, 0.01, shape=arg_shape_dict['rpn_bbox_pred_weight'])
arg_params['rpn_bbox_pred_bias'] = mx.nd.zeros(shape=arg_shape_dict['rpn_bbox_pred_bias'])
arg_params['cls_score_weight'] = mx.random.normal(0, 0.01, shape=arg_shape_dict['cls_score_weight'])
arg_params['cls_score_bias'] = mx.nd.zeros(shape=arg_shape_dict['cls_score_bias'])
arg_params['bbox_pred_weight'] = mx.random.normal(0, 0.001, shape=arg_shape_dict['bbox_pred_weight'])
arg_params['bbox_pred_bias'] = mx.nd.zeros(shape=arg_shape_dict['bbox_pred_bias'])
for k in sym.list_arguments():
if k in data_shape_dict:
continue
assert k in arg_params, k + ' not initialized'
assert arg_params[k].shape == arg_shape_dict[k],
'shape inconsistent for ' + k + ' inferred ' + str(arg_shape_dict[k]) + ' provided ' + str(arg_params[k].shape)
for k in sym.list_auxiliary_states():
assert k in aux_params, k + ' not initialized'
assert aux_params[k].shape == aux_shape_dict[k],
'shape inconsistent for ' + k + ' inferred ' + str(aux_shape_dict[k]) + ' provided ' + str(aux_params[k].shape)
fixed_param_prefix = config.FIXED_PARAMS
data_names = [k[0] for k in train_data.provide_data]
label_names = [k[0] for k in train_data.provide_label]
mod = MutableModule(sym, data_names=data_names, label_names=label_names,
logger=logger, context=ctx, work_load_list=None,
max_data_shapes=max_data_shape, max_label_shapes=max_label_shape,
fixed_param_prefix=fixed_param_prefix)
rpn_eval_metric = metric.RPNAccMetric()
rpn_cls_metric = metric.RPNLogLossMetric()
rpn_bbox_metric = metric.RPNL1LossMetric()
eval_metric = metric.RCNNAccMetric()
cls_metric = metric.RCNNLogLossMetric()
bbox_metric = metric.RCNNL1LossMetric()
eval_metrics = mx.metric.CompositeEvalMetric()
for child_metric in [rpn_eval_metric, rpn_cls_metric, rpn_bbox_metric, eval_metric, cls_metric, bbox_metric]:
eval_metrics.add(child_metric)
batch_end_callback = callback.Speedometer(train_data.batch_size, frequent=default.frequent)
means = np.tile(np.array(config.TRAIN.BBOX_MEANS), config.NUM_CLASSES)
stds = np.tile(np.array(config.TRAIN.BBOX_STDS), config.NUM_CLASSES)
epoch_end_callback = callback.do_checkpoint(prefix, means, stds)
base_lr = lr
lr_factor = 0.1
lr_epoch = [int(epoch) for epoch in lr_step.split(',')]
lr_epoch_diff = [epoch - begin_epoch for epoch in lr_epoch if epoch > begin_epoch]
lr = base_lr * (lr_factor ** (len(lr_epoch) - len(lr_epoch_diff)))
lr_iters = [int(epoch * len(roidb) / batch_size) for epoch in lr_epoch_diff]
logger.info('lr %f lr_epoch_diff %s lr_iters %s' % (lr, lr_epoch_diff, lr_iters))
lr_scheduler = mx.lr_scheduler.MultiFactorScheduler(lr_iters, lr_factor)
optimizer
optimizer_params = {'momentum': 0.9,
'wd': 0.0005,
'learning_rate': lr,
'lr_scheduler': lr_scheduler,
'rescale_grad': (1.0 / batch_size),
'clip_gradient': 5}
train
mod.fit(train_data, eval_metric=eval_metrics, epoch_end_callback=epoch_end_callback,
batch_end_callback=batch_end_callback, kvstore=default.kvstore,
optimizer='sgd', optimizer_params=optimizer_params,
arg_params=arg_params, aux_params=aux_params, begin_epoch=begin_epoch, num_epoch=end_epoch)
What have you tried to solve it?
Rebuild from newest git pull.
Rebuild from 8713d25 (0.10 release)
Changed mshadow::cuda::kMaxThreadsPerBlock to 256.
- Throws error on MapRedKeepLowestKernel because it's trying to launch a kernel with 1024 threads which errors in CheckLaunchParam.
Environment info
Operating System:
Ubuntu 16.04
Compiler:
GCC 5.4
Package used (Python/R/Scala/Julia):
Python
Or if installed from source:
MXNet commit hash (
git rev-parse HEAD):0418aae16c2c6a01bf2e937d6e05596ec21e9087
8713d25 (0.10 release)
Python version and distribution:
2.7.12 (default, Nov 19 2016, 06:48:10)
[GCC 5.4.0 20160609]
Error Message:
[19:20:36] /code/mxnet/src/executor/graph_executor.cc:558: Bucketing: data gt_boxes has a shape (1,123,5), which is larger than already allocated shape (1,100,5). Need to re-allocate. Consider putting default bucket key to be the bucket taking the largest input for better memory sharing.
[19:20:36] /code/mxnet/src/executor/graph_executor.cc:558: Bucketing: data gt_boxes has a shape (1,123,5), which is larger than already allocated shape (1,100,5). Need to re-allocate. Consider putting default bucket key to be the bucket taking the largest input for better memory sharing.
[19:20:36] /code/mxnet/src/executor/graph_executor.cc:558: Bucketing: data gt_boxes has a shape (1,123,5), which is larger than already allocated shape (1,100,5). Need to re-allocate. Consider putting default bucket key to be the bucket taking the largest input for better memory sharing.
[19:20:36] /code/mxnet/src/executor/graph_executor.cc:558: Bucketing: data gt_boxes has a shape (1,123,5), which is larger than already allocated shape (1,100,5). Need to re-allocate. Consider putting default bucket key to be the bucket taking the largest input for better memory sharing.
[19:20:36] /code/mxnet/src/executor/graph_executor.cc:558: Bucketing: data gt_boxes has a shape (1,123,5), which is larger than already allocated shape (1,100,5). Need to re-allocate. Consider putting default bucket key to be the bucket taking the largest input for better memory sharing.
[19:20:36] /code/mxnet/src/executor/graph_executor.cc:558: Bucketing: data gt_boxes has a shape (1,123,5), which is larger than already allocated shape (1,100,5). Need to re-allocate. Consider putting default bucket key to be the bucket taking the largest input for better memory sharing.
[19:20:36] /code/mxnet/src/executor/graph_executor.cc:558: Bucketing: data gt_boxes has a shape (1,123,5), which is larger than already allocated shape (1,100,5). Need to re-allocate. Consider putting default bucket key to be the bucket taking the largest input for better memory sharing.
[19:20:36] /code/mxnet/src/executor/graph_executor.cc:558: Bucketing: data gt_boxes has a shape (1,123,5), which is larger than already allocated shape (1,100,5). Need to re-allocate. Consider putting default bucket key to be the bucket taking the largest input for better memory sharing.
[19:20:36] /code/mxnet/dmlc-core/include/dmlc/logging.h:304: [19:20:36] /code/mxnet/mshadow/mshadow/././././cuda/tensor_gpu-inl.cuh:110: Check failed: err == cudaSuccess (7 vs. 0) Name: MapPlanKernel ErrStr:too many resources requested for launch
Stack trace returned 9 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f81998b95dc]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN7mshadow4cuda7MapPlanINS_2sv6plustoENS_6TensorINS_3gpuELi2EfEENS_4expr14Broadcast1DExpINS4_IS5_Li1EfEEfLi2ELi1EEEfEEvNS7_4PlanIT0_T2_EERKNSB_IT1_SD_EENS_5ShapeILi2EEEP11CUstream_st+0x1bc) [0x7f819a61351c]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(ZN5mxnet2op16FullyConnectedOpIN7mshadow3gpuEfE7ForwardERKNS_9OpContextERKSt6vectorINS_5TBlobESaIS9_EERKS8_INS_9OpReqTypeESaISE_EESD_SD+0x972) [0x7f819a614062]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(+0x6f7c19) [0x7f8199949c19]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x87) [0x7f819992b337]
[bt] (5) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZNSt17_Function_handlerIFvvEZZN5mxnet6engine23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlvE_E9_M_invokeERKSt9_Any_data+0x78) [0x7f819992fab8]
[bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f81eb1d2c80]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f81ef2326ba]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f81eef6882d]
[19:20:36] /code/mxnet/dmlc-core/include/dmlc/logging.h:304: [19:20:36] /code/mxnet/src/engine/./threaded_engine.h:329: [19:20:36] /code/mxnet/mshadow/mshadow/././././cuda/tensor_gpu-inl.cuh:110: Check failed: err == cudaSuccess (7 vs. 0) Name: MapPlanKernel ErrStr:too many resources requested for launch
Stack trace returned 9 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f81998b95dc]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN7mshadow4cuda7MapPlanINS_2sv6plustoENS_6TensorINS_3gpuELi2EfEENS_4expr14Broadcast1DExpINS4_IS5_Li1EfEEfLi2ELi1EEEfEEvNS7_4PlanIT0_T2_EERKNSB_IT1_SD_EENS_5ShapeILi2EEEP11CUstream_st+0x1bc) [0x7f819a61351c]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(ZN5mxnet2op16FullyConnectedOpIN7mshadow3gpuEfE7ForwardERKNS_9OpContextERKSt6vectorINS_5TBlobESaIS9_EERKS8_INS_9OpReqTypeESaISE_EESD_SD+0x972) [0x7f819a614062]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(+0x6f7c19) [0x7f8199949c19]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x87) [0x7f819992b337]
[bt] (5) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZNSt17_Function_handlerIFvvEZZN5mxnet6engine23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlvE_E9_M_invokeERKSt9_Any_data+0x78) [0x7f819992fab8]
[bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f81eb1d2c80]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f81ef2326ba]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f81eef6882d]
An fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.
Stack trace returned 6 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f81998b95dc]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x31a) [0x7f819992b5ca]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZNSt17_Function_handlerIFvvEZZN5mxnet6engine23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlvE_E9_M_invokeERKSt9_Any_data+0x78) [0x7f819992fab8]
[bt] (3) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f81eb1d2c80]
[bt] (4) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f81ef2326ba]
[bt] (5) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f81eef6882d]
terminate called after throwing an instance of 'dmlc::Error'
what(): [19:20:36] /code/mxnet/src/engine/./threaded_engine.h:329: [19:20:36] /code/mxnet/mshadow/mshadow/././././cuda/tensor_gpu-inl.cuh:110: Check failed: err == cudaSuccess (7 vs. 0) Name: MapPlanKernel ErrStr:too many resources requested for launch
Stack trace returned 9 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f81998b95dc]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN7mshadow4cuda7MapPlanINS_2sv6plustoENS_6TensorINS_3gpuELi2EfEENS_4expr14Broadcast1DExpINS4_IS5_Li1EfEEfLi2ELi1EEEfEEvNS7_4PlanIT0_T2_EERKNSB_IT1_SD_EENS_5ShapeILi2EEEP11CUstream_st+0x1bc) [0x7f819a61351c]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(ZN5mxnet2op16FullyConnectedOpIN7mshadow3gpuEfE7ForwardERKNS_9OpContextERKSt6vectorINS_5TBlobESaIS9_EERKS8_INS_9OpReqTypeESaISE_EESD_SD+0x972) [0x7f819a614062]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(+0x6f7c19) [0x7f8199949c19]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x87) [0x7f819992b337]
[bt] (5) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZNSt17_Function_handlerIFvvEZZN5mxnet6engine23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlvE_E9_M_invokeERKSt9_Any_data+0x78) [0x7f819992fab8]
[bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f81eb1d2c80]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f81ef2326ba]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f81eef6882d]
An fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.
Stack trace returned 6 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f81998b95dc]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x31a) [0x7f819992b5ca]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZNSt17_Function_handlerIFvvEZZN5mxnet6engine23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlvE_E9_M_invokeERKSt9_Any_data+0x78) [0x7f819992fab8]
[bt] (3) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f81eb1d2c80]
[bt] (4) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f81ef2326ba]
[bt] (5) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f81eef6882d]
Minimum reproducible example
`
import argparse
import pprint
import mxnet as mx
import numpy as np
import glob
import sys
sys.path.append('/code/mxnet/example/rcnn/')
from rcnn.logger import logger
from rcnn.config import config, default, generate_config
from rcnn.symbol import *
from rcnn.core import callback, metric
from rcnn.core.loader import AnchorLoader
from rcnn.core.module import MutableModule
from rcnn.utils.load_data import load_gt_roidb, merge_roidb, filter_roidb
from rcnn.utils.load_model import load_param
from rcnn.dataset.imdb import IMDB
import xmltodict
from PIL import Image
import cPickle
import os
classes = ['human--person', 'human--rider--bicyclist', 'human--rider--motorcyclist',
'human--rider--other-rider', 'object--pothole', 'object--street-light', 'object--traffic-light',
'object--traffic-sign--back', 'object--traffic-sign--front', 'object--vehicle--bicycle',
'object--vehicle--boat', 'object--vehicle--bus', 'object--vehicle--car',
'object--vehicle--caravan', 'object--vehicle--motorcycle', 'object--vehicle--on-rails',
'object--vehicle--other-vehicle', 'object--vehicle--trailer', 'object--vehicle--truck',
'object--vehicle--wheeled-slow']
class mapillary(IMDB):
def init(self, classes, image_set='training', root_path='./', data_path='./'):
super(mapillary, self).init('mapillary', image_set, root_path, data_path)
self.root_path=root_path
self.image_set = image_set
self.data_path=data_path
self.classes = ['void'] + classes
self.num_classes = len(self.classes)
self.image_files = glob.glob(data_path + image_set + '/images/')
self.num_images = len(self.image_files)
label_files = glob.glob(data_path + 'pre-processed-for-training/pascal_ssd/' + image_set + '/')
self.label_files = {}
for lbl in label_files:
self.label_files[os.path.splitext(os.path.basename(lbl))[0]] = lbl
self.image_set_index = self.load_image_set_index()
config.TRAIN.BATCH_IMAGES = 1
config.TRAIN.BATCH_ROIS = 128
config.TRAIN.END2END = True
config.TRAIN.BBOX_NORMALIZATION_PRECOMPUTED = True
ctx = [mx.gpu(int(i)) for i in range(8)]
network = default.network
default.pretrained = '/mnt/network_data/mxnet/models/vgg16'
import time
date = time.strftime("%Y-%m-%d")
if not os.path.exists(date):
os.makedirs(date)
prefix = date + '/rcnn-' + network
print(prefix)
lr = 0.001
lr_step = '5'
sym = eval('get_' + network + '_train')(num_classes=config.NUM_CLASSES, num_anchors=config.NUM_ANCHORS)
feat_sym = sym.get_internals()['rpn_cls_score_output']
batch_size = len(ctx)
input_batch_size = config.TRAIN.BATCH_IMAGES * batch_size
logger.info(pprint.pformat(config))
image_sets = [mapillary(classes), mapillary(classes, 'validation')]
roidbs = [image_set.gt_roidb() for image_set in image_sets]
roidb = merge_roidb(roidbs)
roidb = filter_roidb(roidb)
train_data = AnchorLoader(feat_sym, roidb, batch_size=input_batch_size, shuffle=True,
ctx=ctx, work_load_list=None,
feat_stride=config.RPN_FEAT_STRIDE, anchor_scales=config.ANCHOR_SCALES,
anchor_ratios=config.ANCHOR_RATIOS, aspect_grouping=config.TRAIN.ASPECT_GROUPING)
max_data_shape = [('data', (input_batch_size, 3, max([v[0] for v in config.SCALES]), max([v[1] for v in config.SCALES])))]
max_data_shape, max_label_shape = train_data.infer_shape(max_data_shape)
max_data_shape.append(('gt_boxes', (input_batch_size, 100, 5)))
logger.info('providing maximum shape %s %s' % (max_data_shape, max_label_shape))
data_shape_dict = dict(train_data.provide_data + train_data.provide_label)
arg_shape, out_shape, aux_shape = sym.infer_shape(**data_shape_dict)
arg_shape_dict = dict(zip(sym.list_arguments(), arg_shape))
out_shape_dict = dict(zip(sym.list_outputs(), out_shape))
aux_shape_dict = dict(zip(sym.list_auxiliary_states(), aux_shape))
logger.info('output shape %s' % pprint.pformat(out_shape_dict))
begin_epoch = 0
end_epoch = default.e2e_epoch
arg_params, aux_params = load_param(default.pretrained, default.pretrained_epoch, convert=True)
arg_params['rpn_conv_3x3_weight'] = mx.random.normal(0, 0.01, shape=arg_shape_dict['rpn_conv_3x3_weight'])
arg_params['rpn_conv_3x3_bias'] = mx.nd.zeros(shape=arg_shape_dict['rpn_conv_3x3_bias'])
arg_params['rpn_cls_score_weight'] = mx.random.normal(0, 0.01, shape=arg_shape_dict['rpn_cls_score_weight'])
arg_params['rpn_cls_score_bias'] = mx.nd.zeros(shape=arg_shape_dict['rpn_cls_score_bias'])
arg_params['rpn_bbox_pred_weight'] = mx.random.normal(0, 0.01, shape=arg_shape_dict['rpn_bbox_pred_weight'])
arg_params['rpn_bbox_pred_bias'] = mx.nd.zeros(shape=arg_shape_dict['rpn_bbox_pred_bias'])
arg_params['cls_score_weight'] = mx.random.normal(0, 0.01, shape=arg_shape_dict['cls_score_weight'])
arg_params['cls_score_bias'] = mx.nd.zeros(shape=arg_shape_dict['cls_score_bias'])
arg_params['bbox_pred_weight'] = mx.random.normal(0, 0.001, shape=arg_shape_dict['bbox_pred_weight'])
arg_params['bbox_pred_bias'] = mx.nd.zeros(shape=arg_shape_dict['bbox_pred_bias'])
for k in sym.list_arguments():
if k in data_shape_dict:
continue
assert k in arg_params, k + ' not initialized'
assert arg_params[k].shape == arg_shape_dict[k],
'shape inconsistent for ' + k + ' inferred ' + str(arg_shape_dict[k]) + ' provided ' + str(arg_params[k].shape)
for k in sym.list_auxiliary_states():
assert k in aux_params, k + ' not initialized'
assert aux_params[k].shape == aux_shape_dict[k],
'shape inconsistent for ' + k + ' inferred ' + str(aux_shape_dict[k]) + ' provided ' + str(aux_params[k].shape)
fixed_param_prefix = config.FIXED_PARAMS
data_names = [k[0] for k in train_data.provide_data]
label_names = [k[0] for k in train_data.provide_label]
mod = MutableModule(sym, data_names=data_names, label_names=label_names,
logger=logger, context=ctx, work_load_list=None,
max_data_shapes=max_data_shape, max_label_shapes=max_label_shape,
fixed_param_prefix=fixed_param_prefix)
rpn_eval_metric = metric.RPNAccMetric()
rpn_cls_metric = metric.RPNLogLossMetric()
rpn_bbox_metric = metric.RPNL1LossMetric()
eval_metric = metric.RCNNAccMetric()
cls_metric = metric.RCNNLogLossMetric()
bbox_metric = metric.RCNNL1LossMetric()
eval_metrics = mx.metric.CompositeEvalMetric()
for child_metric in [rpn_eval_metric, rpn_cls_metric, rpn_bbox_metric, eval_metric, cls_metric, bbox_metric]:
eval_metrics.add(child_metric)
batch_end_callback = callback.Speedometer(train_data.batch_size, frequent=default.frequent)
means = np.tile(np.array(config.TRAIN.BBOX_MEANS), config.NUM_CLASSES)
stds = np.tile(np.array(config.TRAIN.BBOX_STDS), config.NUM_CLASSES)
epoch_end_callback = callback.do_checkpoint(prefix, means, stds)
base_lr = lr
lr_factor = 0.1
lr_epoch = [int(epoch) for epoch in lr_step.split(',')]
lr_epoch_diff = [epoch - begin_epoch for epoch in lr_epoch if epoch > begin_epoch]
lr = base_lr * (lr_factor ** (len(lr_epoch) - len(lr_epoch_diff)))
lr_iters = [int(epoch * len(roidb) / batch_size) for epoch in lr_epoch_diff]
logger.info('lr %f lr_epoch_diff %s lr_iters %s' % (lr, lr_epoch_diff, lr_iters))
lr_scheduler = mx.lr_scheduler.MultiFactorScheduler(lr_iters, lr_factor)
optimizer
optimizer_params = {'momentum': 0.9,
'wd': 0.0005,
'learning_rate': lr,
'lr_scheduler': lr_scheduler,
'rescale_grad': (1.0 / batch_size),
'clip_gradient': 5}
train
mod.fit(train_data, eval_metric=eval_metrics, epoch_end_callback=epoch_end_callback,
batch_end_callback=batch_end_callback, kvstore=default.kvstore,
optimizer='sgd', optimizer_params=optimizer_params,
arg_params=arg_params, aux_params=aux_params, begin_epoch=begin_epoch, num_epoch=end_epoch)
What have you tried to solve it?
Rebuild from newest git pull.
Rebuild from 8713d25 (0.10 release)
Changed mshadow::cuda::kMaxThreadsPerBlock to 256.