Skip to content

Issue in train time #9

@ji-min-song

Description

@ji-min-song

when i train, i got nan for loss

python train.py --model=lmo

step: 0 total_loss: 9.5576973 obj_cls: 2.77258897 frag_cls: 4.15888262 frag_loc: 2.37503433
step: 100 total_loss: nan obj_cls: nan frag_cls: nan frag_loc: nan
INFO:tensorflow:global_step/sec: 2.1272
step: 200 total_loss: nan obj_cls: nan frag_cls: nan frag_loc: nan
INFO:tensorflow:global_step/sec: 2.22972
step: 300 total_loss: nan obj_cls: nan frag_cls: nan frag_loc: nan
INFO:tensorflow:global_step/sec: 2.22798
step: 400 total_loss: nan obj_cls: nan frag_cls: nan frag_loc: 2.52651024
INFO:tensorflow:global_step/sec: 2.22882
step: 500 total_loss: nan obj_cls: nan frag_cls: nan frag_loc: nan
INFO:tensorflow:global_step/sec: 2.23132
step: 600 total_loss: nan obj_cls: nan frag_cls: nan frag_loc: 2.38968158
INFO:tensorflow:global_step/sec: 2.22965
step: 700 total_loss: nan obj_cls: nan frag_cls: nan frag_loc: nan
INFO:tensorflow:global_step/sec: 2.23278
step: 800 total_loss: nan obj_cls: nan frag_cls: nan frag_loc: nan
INFO:tensorflow:global_step/sec: 2.22892
step: 900 total_loss: nan obj_cls: nan frag_cls: nan frag_loc: 2.19273663
INFO:tensorflow:global_step/sec: 2.22798
step: 1000 total_loss: nan obj_cls: nan frag_cls: nan frag_loc: nan
INFO:tensorflow:global_step/sec: 2.22868
step: 1100 total_loss: nan obj_cls: nan frag_cls: nan frag_loc: nan
INFO:tensorflow:global_step/sec: 2.22729

so i think error generated for it

Caused by op 'logits/pred_frag_conf/weights_1', defined at:
File "train.py", line 559, in
tf.app.run()
File "/home/default/anaconda3/envs/epos/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "train.py", line 485, in main
freeze_regex_list=FLAGS.freeze_regex_list)
File "train.py", line 355, in _train_epos_model
reuse_variable=(i != 0))
File "train.py", line 267, in _tower_loss
outputs_to_num_channels)
File "train.py", line 239, in _build_epos_model
tf.summary.histogram(model_var.op.name, model_var)
File "/home/default/anaconda3/envs/epos/lib/python3.6/site-packages/tensorflow/python/summary/summary.py", line 187, in histogram
tag=tag, values=values, name=scope)
File "/home/default/anaconda3/envs/epos/lib/python3.6/site-packages/tensorflow/python/ops/gen_logging_ops.py", line 284, in histogram_summary
"HistogramSummary", tag=tag, values=values, name=name)
File "/home/default/anaconda3/envs/epos/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/default/anaconda3/envs/epos/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/home/default/anaconda3/envs/epos/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/home/default/anaconda3/envs/epos/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in init
self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Nan in summary histogram for: logits/pred_frag_conf/weights_1
[[node logits/pred_frag_conf/weights_1 (defined at train.py:239) = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](logits/pred_frag_conf/weights_1/tag, logits/pred_frag_conf/weights/read/_9035)]]
[[{{node xception_65/middle_flow/block1/unit_3/xception_module/separable_conv2_depthwise/BatchNorm/moving_mean/read/_9950}} = _SendT=DT_FLOAT, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1856_..._mean/read", _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

What can i do for training?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions