Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

Pascal TitanX too many resources requested for launch #6775

@dtmoodie

Description

@dtmoodie

Environment info

Operating System:
Ubuntu 16.04

Compiler:
GCC 5.4

Package used (Python/R/Scala/Julia):
Python

Or if installed from source:

MXNet commit hash (git rev-parse HEAD):
0418aae16c2c6a01bf2e937d6e05596ec21e9087
8713d25 (0.10 release)

Python version and distribution:
2.7.12 (default, Nov 19 2016, 06:48:10)
[GCC 5.4.0 20160609]

Error Message:

[19:20:36] /code/mxnet/src/executor/graph_executor.cc:558: Bucketing: data gt_boxes has a shape (1,123,5), which is larger than already allocated shape (1,100,5). Need to re-allocate. Consider putting default bucket key to be the bucket taking the largest input for better memory sharing.
[19:20:36] /code/mxnet/src/executor/graph_executor.cc:558: Bucketing: data gt_boxes has a shape (1,123,5), which is larger than already allocated shape (1,100,5). Need to re-allocate. Consider putting default bucket key to be the bucket taking the largest input for better memory sharing.
[19:20:36] /code/mxnet/src/executor/graph_executor.cc:558: Bucketing: data gt_boxes has a shape (1,123,5), which is larger than already allocated shape (1,100,5). Need to re-allocate. Consider putting default bucket key to be the bucket taking the largest input for better memory sharing.
[19:20:36] /code/mxnet/src/executor/graph_executor.cc:558: Bucketing: data gt_boxes has a shape (1,123,5), which is larger than already allocated shape (1,100,5). Need to re-allocate. Consider putting default bucket key to be the bucket taking the largest input for better memory sharing.
[19:20:36] /code/mxnet/src/executor/graph_executor.cc:558: Bucketing: data gt_boxes has a shape (1,123,5), which is larger than already allocated shape (1,100,5). Need to re-allocate. Consider putting default bucket key to be the bucket taking the largest input for better memory sharing.
[19:20:36] /code/mxnet/src/executor/graph_executor.cc:558: Bucketing: data gt_boxes has a shape (1,123,5), which is larger than already allocated shape (1,100,5). Need to re-allocate. Consider putting default bucket key to be the bucket taking the largest input for better memory sharing.
[19:20:36] /code/mxnet/src/executor/graph_executor.cc:558: Bucketing: data gt_boxes has a shape (1,123,5), which is larger than already allocated shape (1,100,5). Need to re-allocate. Consider putting default bucket key to be the bucket taking the largest input for better memory sharing.
[19:20:36] /code/mxnet/src/executor/graph_executor.cc:558: Bucketing: data gt_boxes has a shape (1,123,5), which is larger than already allocated shape (1,100,5). Need to re-allocate. Consider putting default bucket key to be the bucket taking the largest input for better memory sharing.
[19:20:36] /code/mxnet/dmlc-core/include/dmlc/logging.h:304: [19:20:36] /code/mxnet/mshadow/mshadow/././././cuda/tensor_gpu-inl.cuh:110: Check failed: err == cudaSuccess (7 vs. 0) Name: MapPlanKernel ErrStr:too many resources requested for launch

Stack trace returned 9 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f81998b95dc]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN7mshadow4cuda7MapPlanINS_2sv6plustoENS_6TensorINS_3gpuELi2EfEENS_4expr14Broadcast1DExpINS4_IS5_Li1EfEEfLi2ELi1EEEfEEvNS7_4PlanIT0_T2_EERKNSB_IT1_SD_EENS_5ShapeILi2EEEP11CUstream_st+0x1bc) [0x7f819a61351c]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(ZN5mxnet2op16FullyConnectedOpIN7mshadow3gpuEfE7ForwardERKNS_9OpContextERKSt6vectorINS_5TBlobESaIS9_EERKS8_INS_9OpReqTypeESaISE_EESD_SD+0x972) [0x7f819a614062]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(+0x6f7c19) [0x7f8199949c19]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x87) [0x7f819992b337]
[bt] (5) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZNSt17_Function_handlerIFvvEZZN5mxnet6engine23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlvE_E9_M_invokeERKSt9_Any_data+0x78) [0x7f819992fab8]
[bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f81eb1d2c80]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f81ef2326ba]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f81eef6882d]

[19:20:36] /code/mxnet/dmlc-core/include/dmlc/logging.h:304: [19:20:36] /code/mxnet/src/engine/./threaded_engine.h:329: [19:20:36] /code/mxnet/mshadow/mshadow/././././cuda/tensor_gpu-inl.cuh:110: Check failed: err == cudaSuccess (7 vs. 0) Name: MapPlanKernel ErrStr:too many resources requested for launch

Stack trace returned 9 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f81998b95dc]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN7mshadow4cuda7MapPlanINS_2sv6plustoENS_6TensorINS_3gpuELi2EfEENS_4expr14Broadcast1DExpINS4_IS5_Li1EfEEfLi2ELi1EEEfEEvNS7_4PlanIT0_T2_EERKNSB_IT1_SD_EENS_5ShapeILi2EEEP11CUstream_st+0x1bc) [0x7f819a61351c]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(ZN5mxnet2op16FullyConnectedOpIN7mshadow3gpuEfE7ForwardERKNS_9OpContextERKSt6vectorINS_5TBlobESaIS9_EERKS8_INS_9OpReqTypeESaISE_EESD_SD+0x972) [0x7f819a614062]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(+0x6f7c19) [0x7f8199949c19]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x87) [0x7f819992b337]
[bt] (5) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZNSt17_Function_handlerIFvvEZZN5mxnet6engine23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlvE_E9_M_invokeERKSt9_Any_data+0x78) [0x7f819992fab8]
[bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f81eb1d2c80]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f81ef2326ba]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f81eef6882d]

An fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.

Stack trace returned 6 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f81998b95dc]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x31a) [0x7f819992b5ca]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZNSt17_Function_handlerIFvvEZZN5mxnet6engine23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlvE_E9_M_invokeERKSt9_Any_data+0x78) [0x7f819992fab8]
[bt] (3) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f81eb1d2c80]
[bt] (4) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f81ef2326ba]
[bt] (5) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f81eef6882d]

terminate called after throwing an instance of 'dmlc::Error'
what(): [19:20:36] /code/mxnet/src/engine/./threaded_engine.h:329: [19:20:36] /code/mxnet/mshadow/mshadow/././././cuda/tensor_gpu-inl.cuh:110: Check failed: err == cudaSuccess (7 vs. 0) Name: MapPlanKernel ErrStr:too many resources requested for launch

Stack trace returned 9 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f81998b95dc]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN7mshadow4cuda7MapPlanINS_2sv6plustoENS_6TensorINS_3gpuELi2EfEENS_4expr14Broadcast1DExpINS4_IS5_Li1EfEEfLi2ELi1EEEfEEvNS7_4PlanIT0_T2_EERKNSB_IT1_SD_EENS_5ShapeILi2EEEP11CUstream_st+0x1bc) [0x7f819a61351c]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(ZN5mxnet2op16FullyConnectedOpIN7mshadow3gpuEfE7ForwardERKNS_9OpContextERKSt6vectorINS_5TBlobESaIS9_EERKS8_INS_9OpReqTypeESaISE_EESD_SD+0x972) [0x7f819a614062]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(+0x6f7c19) [0x7f8199949c19]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x87) [0x7f819992b337]
[bt] (5) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZNSt17_Function_handlerIFvvEZZN5mxnet6engine23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlvE_E9_M_invokeERKSt9_Any_data+0x78) [0x7f819992fab8]
[bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f81eb1d2c80]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f81ef2326ba]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f81eef6882d]

An fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.

Stack trace returned 6 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f81998b95dc]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x31a) [0x7f819992b5ca]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-0.10.1-py2.7.egg/mxnet/libmxnet.so(_ZNSt17_Function_handlerIFvvEZZN5mxnet6engine23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlvE_E9_M_invokeERKSt9_Any_data+0x78) [0x7f819992fab8]
[bt] (3) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f81eb1d2c80]
[bt] (4) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f81ef2326ba]
[bt] (5) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f81eef6882d]

Minimum reproducible example

`

import argparse
import pprint
import mxnet as mx
import numpy as np
import glob
import sys
sys.path.append('/code/mxnet/example/rcnn/')
from rcnn.logger import logger
from rcnn.config import config, default, generate_config
from rcnn.symbol import *
from rcnn.core import callback, metric
from rcnn.core.loader import AnchorLoader
from rcnn.core.module import MutableModule
from rcnn.utils.load_data import load_gt_roidb, merge_roidb, filter_roidb
from rcnn.utils.load_model import load_param
from rcnn.dataset.imdb import IMDB
import xmltodict
from PIL import Image
import cPickle
import os
classes = ['human--person', 'human--rider--bicyclist', 'human--rider--motorcyclist',
'human--rider--other-rider', 'object--pothole', 'object--street-light', 'object--traffic-light',
'object--traffic-sign--back', 'object--traffic-sign--front', 'object--vehicle--bicycle',
'object--vehicle--boat', 'object--vehicle--bus', 'object--vehicle--car',
'object--vehicle--caravan', 'object--vehicle--motorcycle', 'object--vehicle--on-rails',
'object--vehicle--other-vehicle', 'object--vehicle--trailer', 'object--vehicle--truck',
'object--vehicle--wheeled-slow']

class mapillary(IMDB):
def init(self, classes, image_set='training', root_path='./', data_path='./'):
super(mapillary, self).init('mapillary', image_set, root_path, data_path)
self.root_path=root_path
self.image_set = image_set
self.data_path=data_path
self.classes = ['void'] + classes
self.num_classes = len(self.classes)
self.image_files = glob.glob(data_path + image_set + '/images/')
self.num_images = len(self.image_files)
label_files = glob.glob(data_path + 'pre-processed-for-training/pascal_ssd/' + image_set + '/
')
self.label_files = {}
for lbl in label_files:
self.label_files[os.path.splitext(os.path.basename(lbl))[0]] = lbl
self.image_set_index = self.load_image_set_index()

def load_image_set_index(self):
    """
    find out which indexes correspond to given image set (train or val)
    :return:
    """
    image_set_index = range(0, len(self.image_files))
    return image_set_index


def image_path_from_index(self, index):
    """
    given image index, find out full path
    :param index: index of a specific image
    :return: full path of this image
    """
    image_file = self.image_files[index]
    assert os.path.exists(image_file), 'Path does not exist: {}'.format(image_file)
    return image_file

def gt_roidb(self):
    """
    return ground truth image regions database
    :return: imdb[image_index]['boxes', 'gt_classes', 'gt_overlaps', 'flipped']
    """
    cache_file = os.path.join(self.cache_path, self.name + '_gt_roidb.pkl')
    if os.path.exists(cache_file):
        with open(cache_file, 'rb') as fid:
            roidb = cPickle.load(fid)
        logger.info('%s gt roidb loaded from %s' % (self.name, cache_file))
        return roidb

    gt_roidb = [self.load_pascal_annotation(index) for index in self.image_set_index]
    with open(cache_file, 'wb') as fid:
        cPickle.dump(gt_roidb, fid, cPickle.HIGHEST_PROTOCOL)
    logger.info('%s wrote gt roidb to %s' % (self.name, cache_file))

    return gt_roidb

def load_pascal_annotation(self, image_index):
    image_path = self.image_files[image_index]
    name = os.path.splitext(os.path.basename(self.image_files[image_index]))[0]
    import xml.etree.ElementTree as ET
    roi_rec = dict()
    roi_rec['image'] = image_path
    im = Image.open(image_path)
    width, height = im.size
    roi_rec['height'] = height
    roi_rec['width'] = width

    tree = ET.parse(self.label_files[name])
    objs = tree.findall('object')
    
    num_objs = len(objs)

    boxes = np.zeros((num_objs, 4), dtype=np.uint16)
    gt_classes = np.zeros((num_objs), dtype=np.int32)
    overlaps = np.zeros((num_objs, self.num_classes), dtype=np.float32)

    class_to_index = dict(zip(self.classes, range(self.num_classes)))
    # Load object bounding boxes into a data frame.
    for ix, obj in enumerate(objs):
        bbox = obj.find('bndbox')
        # Make pixel indexes 0-based
        x1 = float(bbox.find('xmin').text) - 1
        y1 = float(bbox.find('ymin').text) - 1
        x2 = float(bbox.find('xmax').text) - 1
        y2 = float(bbox.find('ymax').text) - 1
        cls = class_to_index[obj.find('name').text.lower().strip()]
        boxes[ix, :] = [x1, y1, x2, y2]
        gt_classes[ix] = cls
        overlaps[ix, cls] = 1.0

    roi_rec.update({'boxes': boxes,
                    'gt_classes': gt_classes,
                    'gt_overlaps': overlaps,
                    'max_classes': overlaps.argmax(axis=1),
                    'max_overlaps': overlaps.max(axis=1),
                    'flipped': False})
    return roi_rec

config.TRAIN.BATCH_IMAGES = 1
config.TRAIN.BATCH_ROIS = 128
config.TRAIN.END2END = True
config.TRAIN.BBOX_NORMALIZATION_PRECOMPUTED = True
ctx = [mx.gpu(int(i)) for i in range(8)]

network = default.network
default.pretrained = '/mnt/network_data/mxnet/models/vgg16'
import time
date = time.strftime("%Y-%m-%d")

if not os.path.exists(date):
os.makedirs(date)
prefix = date + '/rcnn-' + network
print(prefix)
lr = 0.001
lr_step = '5'

sym = eval('get_' + network + '_train')(num_classes=config.NUM_CLASSES, num_anchors=config.NUM_ANCHORS)
feat_sym = sym.get_internals()['rpn_cls_score_output']

batch_size = len(ctx)
input_batch_size = config.TRAIN.BATCH_IMAGES * batch_size

logger.info(pprint.pformat(config))

image_sets = [mapillary(classes), mapillary(classes, 'validation')]

roidbs = [image_set.gt_roidb() for image_set in image_sets]
roidb = merge_roidb(roidbs)
roidb = filter_roidb(roidb)

train_data = AnchorLoader(feat_sym, roidb, batch_size=input_batch_size, shuffle=True,
ctx=ctx, work_load_list=None,
feat_stride=config.RPN_FEAT_STRIDE, anchor_scales=config.ANCHOR_SCALES,
anchor_ratios=config.ANCHOR_RATIOS, aspect_grouping=config.TRAIN.ASPECT_GROUPING)

max_data_shape = [('data', (input_batch_size, 3, max([v[0] for v in config.SCALES]), max([v[1] for v in config.SCALES])))]
max_data_shape, max_label_shape = train_data.infer_shape(max_data_shape)
max_data_shape.append(('gt_boxes', (input_batch_size, 100, 5)))
logger.info('providing maximum shape %s %s' % (max_data_shape, max_label_shape))

data_shape_dict = dict(train_data.provide_data + train_data.provide_label)
arg_shape, out_shape, aux_shape = sym.infer_shape(**data_shape_dict)
arg_shape_dict = dict(zip(sym.list_arguments(), arg_shape))
out_shape_dict = dict(zip(sym.list_outputs(), out_shape))
aux_shape_dict = dict(zip(sym.list_auxiliary_states(), aux_shape))
logger.info('output shape %s' % pprint.pformat(out_shape_dict))

begin_epoch = 0
end_epoch = default.e2e_epoch
arg_params, aux_params = load_param(default.pretrained, default.pretrained_epoch, convert=True)
arg_params['rpn_conv_3x3_weight'] = mx.random.normal(0, 0.01, shape=arg_shape_dict['rpn_conv_3x3_weight'])
arg_params['rpn_conv_3x3_bias'] = mx.nd.zeros(shape=arg_shape_dict['rpn_conv_3x3_bias'])
arg_params['rpn_cls_score_weight'] = mx.random.normal(0, 0.01, shape=arg_shape_dict['rpn_cls_score_weight'])
arg_params['rpn_cls_score_bias'] = mx.nd.zeros(shape=arg_shape_dict['rpn_cls_score_bias'])
arg_params['rpn_bbox_pred_weight'] = mx.random.normal(0, 0.01, shape=arg_shape_dict['rpn_bbox_pred_weight'])
arg_params['rpn_bbox_pred_bias'] = mx.nd.zeros(shape=arg_shape_dict['rpn_bbox_pred_bias'])
arg_params['cls_score_weight'] = mx.random.normal(0, 0.01, shape=arg_shape_dict['cls_score_weight'])
arg_params['cls_score_bias'] = mx.nd.zeros(shape=arg_shape_dict['cls_score_bias'])
arg_params['bbox_pred_weight'] = mx.random.normal(0, 0.001, shape=arg_shape_dict['bbox_pred_weight'])
arg_params['bbox_pred_bias'] = mx.nd.zeros(shape=arg_shape_dict['bbox_pred_bias'])

for k in sym.list_arguments():
if k in data_shape_dict:
continue
assert k in arg_params, k + ' not initialized'
assert arg_params[k].shape == arg_shape_dict[k],
'shape inconsistent for ' + k + ' inferred ' + str(arg_shape_dict[k]) + ' provided ' + str(arg_params[k].shape)
for k in sym.list_auxiliary_states():
assert k in aux_params, k + ' not initialized'
assert aux_params[k].shape == aux_shape_dict[k],
'shape inconsistent for ' + k + ' inferred ' + str(aux_shape_dict[k]) + ' provided ' + str(aux_params[k].shape)

fixed_param_prefix = config.FIXED_PARAMS
data_names = [k[0] for k in train_data.provide_data]
label_names = [k[0] for k in train_data.provide_label]
mod = MutableModule(sym, data_names=data_names, label_names=label_names,
logger=logger, context=ctx, work_load_list=None,
max_data_shapes=max_data_shape, max_label_shapes=max_label_shape,
fixed_param_prefix=fixed_param_prefix)

rpn_eval_metric = metric.RPNAccMetric()
rpn_cls_metric = metric.RPNLogLossMetric()
rpn_bbox_metric = metric.RPNL1LossMetric()
eval_metric = metric.RCNNAccMetric()
cls_metric = metric.RCNNLogLossMetric()
bbox_metric = metric.RCNNL1LossMetric()
eval_metrics = mx.metric.CompositeEvalMetric()
for child_metric in [rpn_eval_metric, rpn_cls_metric, rpn_bbox_metric, eval_metric, cls_metric, bbox_metric]:
eval_metrics.add(child_metric)

batch_end_callback = callback.Speedometer(train_data.batch_size, frequent=default.frequent)
means = np.tile(np.array(config.TRAIN.BBOX_MEANS), config.NUM_CLASSES)
stds = np.tile(np.array(config.TRAIN.BBOX_STDS), config.NUM_CLASSES)
epoch_end_callback = callback.do_checkpoint(prefix, means, stds)

base_lr = lr
lr_factor = 0.1
lr_epoch = [int(epoch) for epoch in lr_step.split(',')]
lr_epoch_diff = [epoch - begin_epoch for epoch in lr_epoch if epoch > begin_epoch]
lr = base_lr * (lr_factor ** (len(lr_epoch) - len(lr_epoch_diff)))
lr_iters = [int(epoch * len(roidb) / batch_size) for epoch in lr_epoch_diff]
logger.info('lr %f lr_epoch_diff %s lr_iters %s' % (lr, lr_epoch_diff, lr_iters))
lr_scheduler = mx.lr_scheduler.MultiFactorScheduler(lr_iters, lr_factor)

optimizer

optimizer_params = {'momentum': 0.9,
'wd': 0.0005,
'learning_rate': lr,
'lr_scheduler': lr_scheduler,
'rescale_grad': (1.0 / batch_size),
'clip_gradient': 5}

train

mod.fit(train_data, eval_metric=eval_metrics, epoch_end_callback=epoch_end_callback,
batch_end_callback=batch_end_callback, kvstore=default.kvstore,
optimizer='sgd', optimizer_params=optimizer_params,
arg_params=arg_params, aux_params=aux_params, begin_epoch=begin_epoch, num_epoch=end_epoch)

What have you tried to solve it?

Rebuild from newest git pull.
Rebuild from 8713d25 (0.10 release)
Changed mshadow::cuda::kMaxThreadsPerBlock to 256.

  • Throws error on MapRedKeepLowestKernel because it's trying to launch a kernel with 1024 threads which errors in CheckLaunchParam.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions