Skip to content

How to solve the problem of "summary" errors during training? #99

@sunruina2

Description

@sunruina2

epoch 0, total_step 90180, total loss is 15.06 , inference loss is 8.29, weight deacy loss is 6.77, training accuracy is 0.312500, time 124.800 samples/sec
epoch 0, total_step 90200, total loss is 15.34 , inference loss is 8.57, weight deacy loss is 6.77, training accuracy is 0.343750, time 132.006 samples/sec
epoch 0, total_step 90220, total loss is 14.04 , inference loss is 7.27, weight deacy loss is 6.77, training accuracy is 0.328125, time 123.523 samples/sec
epoch 0, total_step 90240, total loss is 17.67 , inference loss is 10.90, weight deacy loss is 6.77, training accuracy is 0.281250, time 130.974 samples/sec
epoch 0, total_step 90260, total loss is nan , inference loss is nan, weight deacy loss is nan, training accuracy is 0.000000, time 128.621 samples/sec
epoch 0, total_step 90280, total loss is nan , inference loss is nan, weight deacy loss is nan, training accuracy is 0.000000, time 133.669 samples/sec
Traceback (most recent call last):
File "/data/sunruina/anaconda2/envs/py36ten12/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/data/sunruina/anaconda2/envs/py36ten12/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/data/sunruina/anaconda2/envs/py36ten12/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: resnet_v1_50/block3/unit_13/bottleneck_v1/conv2_bn/BatchNorm/gamma_1
[[{{node resnet_v1_50/block3/unit_13/bottleneck_v1/conv2_bn/BatchNorm/gamma_1}} = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](resnet_v1_50/block3/unit_13/bottleneck_v1/conv2_bn/BatchNorm/gamma_1/tag, resnet_v1_50/block3/unit_13/bottleneck_v1/conv2_bn/BatchNorm/gamma/read/_1409)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "train_nets.py", line 210, in
summary_op_val = sess.run(summary_op, feed_dict=feed_dict)
File "/data/sunruina/anaconda2/envs/py36ten12/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/data/sunruina/anaconda2/envs/py36ten12/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/data/sunruina/anaconda2/envs/py36ten12/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/data/sunruina/anaconda2/envs/py36ten12/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: resnet_v1_50/block3/unit_13/bottleneck_v1/conv2_bn/BatchNorm/gamma_1
[[node resnet_v1_50/block3/unit_13/bottleneck_v1/conv2_bn/BatchNorm/gamma_1 (defined at train_nets.py:161) = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](resnet_v1_50/block3/unit_13/bottleneck_v1/conv2_bn/BatchNorm/gamma_1/tag, resnet_v1_50/block3/unit_13/bottleneck_v1/conv2_bn/BatchNorm/gamma/read/_1409)]]

Caused by op 'resnet_v1_50/block3/unit_13/bottleneck_v1/conv2_bn/BatchNorm/gamma_1', defined at:
File "train_nets.py", line 161, in
summaries.append(tf.summary.histogram(var.op.name, var))
File "/data/sunruina/anaconda2/envs/py36ten12/lib/python3.6/site-packages/tensorflow/python/summary/summary.py", line 187, in histogram
tag=tag, values=values, name=scope)
File "/data/sunruina/anaconda2/envs/py36ten12/lib/python3.6/site-packages/tensorflow/python/ops/gen_logging_ops.py", line 284, in histogram_summary
"HistogramSummary", tag=tag, values=values, name=name)
File "/data/sunruina/anaconda2/envs/py36ten12/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/data/sunruina/anaconda2/envs/py36ten12/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/data/sunruina/anaconda2/envs/py36ten12/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/data/sunruina/anaconda2/envs/py36ten12/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in init
self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Nan in summary histogram for: resnet_v1_50/block3/unit_13/bottleneck_v1/conv2_bn/BatchNorm/gamma_1
[[node resnet_v1_50/block3/unit_13/bottleneck_v1/conv2_bn/BatchNorm/gamma_1 (defined at train_nets.py:161) = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](resnet_v1_50/block3/unit_13/bottleneck_v1/conv2_bn/BatchNorm/gamma_1/tag, resnet_v1_50/block3/unit_13/bottleneck_v1/conv2_bn/BatchNorm/gamma/read/_1409)]]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions