Skip to content

Caffe hang when creating data layer #3965

@royitaqi

Description

@royitaqi

I created a simplest net to learn the division "/" function (input is A and B, label is A/B). However, when I try to run the trainer, it hang forever. If I do killall caffe, I see that it's waiting for BlockingQueue. Searched around and it was mentioned (didn't note down the source) that it might be caused by the training and testing phase sharing the same lmdb. So I copied the same data to separate training and testing folders, but the problem persists.

Wondering why the hang, and how I should debug this problem?

Here is the console output:

[tw-mbp-rshi playgit]$ caffe train --solver solver.prototxt
I0408 21:57:11.489527 1949106944 caffe.cpp:178] Use CPU.
I0408 21:57:11.493430 1949106944 solver.cpp:48] Initializing solver from parameters:
test_iter: 1
test_interval: 2
base_lr: 0.01
display: 1
max_iter: 100
lr_policy: "inv"
gamma: 0.0001
power: 0.75
momentum: 0.9
weight_decay: 0.0005
snapshot: 5
snapshot_prefix: "snapshot"
solver_mode: CPU
net: "net.prototxt"
I0408 21:57:11.494869 1949106944 solver.cpp:91] Creating training net from net file: net.prototxt
I0408 21:57:11.495998 1949106944 net.cpp:313] The NetState phase (0) differed from the phase (1) specified by a rule in layer testing
I0408 21:57:11.496026 1949106944 net.cpp:313] The NetState phase (0) differed from the phase (1) specified by a rule in layer testing_label
I0408 21:57:11.496052 1949106944 net.cpp:49] Initializing net from parameters:
state {
  phase: TRAIN
}
layer {
  name: "training"
  type: "Data"
  top: "data"
  include {
    phase: TRAIN
  }
  data_param {
    source: "training"
    backend: LMDB
  }
}
layer {
  name: "training_label"
  type: "Data"
  top: "label"
  include {
    phase: TRAIN
  }
  data_param {
    source: "training_label"
    batch_size: 1
    backend: LMDB
  }
}
layer {
  name: "full"
  type: "InnerProduct"
  bottom: "data"
  top: "full"
  param {
    lr_mult: 1
    decay_mult: 1
  }
  param {
    lr_mult: 2
    decay_mult: 0
  }
  inner_product_param {
    num_output: 1
    weight_filler {
      type: "gaussian"
      std: 0.01
    }
    bias_filler {
      type: "constant"
      value: 0
    }
  }
}
layer {
  name: "loss"
  type: "EuclideanLoss"
  bottom: "full"
  bottom: "label"
  top: "loss"
}
I0408 21:57:11.496322 1949106944 layer_factory.hpp:77] Creating layer training
I0408 21:57:11.503118 1949106944 net.cpp:91] Creating Layer training
I0408 21:57:11.503237 1949106944 net.cpp:399] training -> data
I0408 21:57:11.504497 186691584 db_lmdb.cpp:38] Opened lmdb training
*** Aborted at 1460178183 (unix time) try "date -d @1460178183" if you are using GNU date ***
PC: @     0x7fff8f110136 __psynch_cvwait
*** SIGTERM (@0x7fff8f110136) received by PID 6373 (TID 0x7fff742d0300) stack trace: ***
    @     0x7fff89d17f1a _sigtramp
    @     0x7fff5850c620 (unknown)
    @        0x10784869b boost::condition_variable::wait()
    @        0x107849687 caffe::BlockingQueue<>::peek()
    @        0x1077b6f46 caffe::DataLayer<>::DataLayerSetUp()
    @        0x1077a640e caffe::BasePrefetchingDataLayer<>::LayerSetUp()
    @        0x1078148e7 caffe::Net<>::Init()
    @        0x107813385 caffe::Net<>::Net()
    @        0x10782f090 caffe::Solver<>::InitTrainNet()
    @        0x10782e3e7 caffe::Solver<>::Init()
    @        0x10782e0de caffe::Solver<>::Solver()
    @        0x10783e8a8 caffe::SGDSolver<>::SGDSolver()
    @        0x107844182 caffe::Creator_SGDSolver<>()
    @        0x1076f3137 train()
    @        0x1076f5721 main
    @     0x7fff90c165c9 start
Terminated: 15
[tw-mbp-rshi playgit]$

Here is my solver.prototxt:

# The train/test net protocol buffer definition
net: "net.prototxt"

# test_iter specifies how many forward passes the test should carry out.
# In the case of MNIST, we have test batch size 100 and 100 test iterations,
# covering the full 10,000 testing images.
test_iter: 1

# Carry out testing every 500 training iterations.
test_interval: 2

# The base learning rate, momentum and the weight decay of the network.
base_lr: 0.01
momentum: 0.9
weight_decay: 0.0005

# The learning rate policy
lr_policy: "inv"
gamma: 0.0001
power: 0.75

# Display every 100 iterations
display: 1

# The maximum number of iterations
max_iter: 100

# snapshot intermediate results
snapshot: 5
snapshot_prefix: "snapshot"

# solver mode: CPU or GPU
solver_mode: CPU

Here is my net.prototxt:


layer {
  name: "training"
  type: "Data"
  top: "data"
  include {
    phase: TRAIN
  }
  data_param {
    source: "training"
    backend: LMDB
  }
}

layer {
  name: "testing"
  type: "Data"
  top: "data"
  include {
    phase: TEST
  }
  data_param {
    source: "testing"
    backend: LMDB
  }
}

layer {
  name: "training_label"
  type: "Data"
  top: "label"
  include {
    phase: TRAIN
  }
  data_param {
    source: "training_label"
    batch_size:1
    backend: LMDB
  }
}

layer {
  name: "testing_label"
  type: "Data"
  top: "label"
  include {
    phase: TEST
  }
  data_param {
    source: "testing_label"
    batch_size:1
    backend: LMDB
  }
}



layer {
  name: "full"
  type: "InnerProduct"
  # learning rate and decay multipliers for the weights
  param { lr_mult: 1 decay_mult: 1 }
  # learning rate and decay multipliers for the biases
  param { lr_mult: 2 decay_mult: 0 }
  inner_product_param {
    num_output: 1
    weight_filler {
      type: "gaussian"
      std: 0.01
    }
    bias_filler {
      type: "constant"
      value: 0
    }
  }
  bottom: "data"
  top: "full"
}



layer {
  name: "loss"
  type: "EuclideanLoss"
  bottom: "full"
  bottom: "label"
  top: "loss"
}

Here is how I generated the training and label data:

import numpy as np
import lmdb
import caffe
import random

N = 100

# Let's pretend this is interesting data
X = np.zeros((N, 2, 1, 1), dtype=np.float)
y = np.zeros(N, dtype=np.float)

random.seed(0)

for i in range(0, N):
    X[i,0,0,0] = random.uniform(8, 10)
    X[i,1,0,0] = random.uniform(6, 8)
    y[i] = X[i,0,0,0] / X[i,1,0,0]


with lmdb.open('training', map_size=int(1e12)) as db:
    with db.begin(write=True) as transaction:
        for i in range(N):
            datum = caffe.proto.caffe_pb2.Datum()
            datum.channels = X.shape[1]
            datum.height = X.shape[2]
            datum.width = X.shape[3]
            datum.data = X[i].tobytes()
            str_id = '{:08}'.format(i)
            #
            # The encode is only essential in Python 3
            transaction.put(str_id.encode('ascii'), datum.SerializeToString())


with lmdb.open('label', map_size=int(1e12)) as db:
    with db.begin(write=True) as transaction:
        for i in range(N):
            datum = caffe.proto.caffe_pb2.Datum()
            datum.channels = 1
            datum.height = 1
            datum.width = 1
            datum.data = y[i].tobytes()
            str_id = '{:08}'.format(i)
            #
            # The encode is only essential in Python 3
            transaction.put(str_id.encode('ascii'), datum.SerializeToString())

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions