Caffe hang when creating data layer

I created a simplest net to learn the division "/" function (input is A and B, label is A/B). However, when I try to run the trainer, it hang forever. If I do `killall caffe`, I see that it's waiting for `BlockingQueue`. Searched around and it was mentioned (didn't note down the source) that it might be caused by the training and testing phase sharing the same lmdb. So I copied the same data to separate training and testing folders, but the problem persists.

**Wondering why the hang, and how I should debug this problem?**

Here is the console output:

```
[tw-mbp-rshi playgit]$ caffe train --solver solver.prototxt
I0408 21:57:11.489527 1949106944 caffe.cpp:178] Use CPU.
I0408 21:57:11.493430 1949106944 solver.cpp:48] Initializing solver from parameters:
test_iter: 1
test_interval: 2
base_lr: 0.01
display: 1
max_iter: 100
lr_policy: "inv"
gamma: 0.0001
power: 0.75
momentum: 0.9
weight_decay: 0.0005
snapshot: 5
snapshot_prefix: "snapshot"
solver_mode: CPU
net: "net.prototxt"
I0408 21:57:11.494869 1949106944 solver.cpp:91] Creating training net from net file: net.prototxt
I0408 21:57:11.495998 1949106944 net.cpp:313] The NetState phase (0) differed from the phase (1) specified by a rule in layer testing
I0408 21:57:11.496026 1949106944 net.cpp:313] The NetState phase (0) differed from the phase (1) specified by a rule in layer testing_label
I0408 21:57:11.496052 1949106944 net.cpp:49] Initializing net from parameters:
state {
  phase: TRAIN
}
layer {
  name: "training"
  type: "Data"
  top: "data"
  include {
    phase: TRAIN
  }
  data_param {
    source: "training"
    backend: LMDB
  }
}
layer {
  name: "training_label"
  type: "Data"
  top: "label"
  include {
    phase: TRAIN
  }
  data_param {
    source: "training_label"
    batch_size: 1
    backend: LMDB
  }
}
layer {
  name: "full"
  type: "InnerProduct"
  bottom: "data"
  top: "full"
  param {
    lr_mult: 1
    decay_mult: 1
  }
  param {
    lr_mult: 2
    decay_mult: 0
  }
  inner_product_param {
    num_output: 1
    weight_filler {
      type: "gaussian"
      std: 0.01
    }
    bias_filler {
      type: "constant"
      value: 0
    }
  }
}
layer {
  name: "loss"
  type: "EuclideanLoss"
  bottom: "full"
  bottom: "label"
  top: "loss"
}
I0408 21:57:11.496322 1949106944 layer_factory.hpp:77] Creating layer training
I0408 21:57:11.503118 1949106944 net.cpp:91] Creating Layer training
I0408 21:57:11.503237 1949106944 net.cpp:399] training -> data
I0408 21:57:11.504497 186691584 db_lmdb.cpp:38] Opened lmdb training
*** Aborted at 1460178183 (unix time) try "date -d @1460178183" if you are using GNU date ***
PC: @     0x7fff8f110136 __psynch_cvwait
*** SIGTERM (@0x7fff8f110136) received by PID 6373 (TID 0x7fff742d0300) stack trace: ***
    @     0x7fff89d17f1a _sigtramp
    @     0x7fff5850c620 (unknown)
    @        0x10784869b boost::condition_variable::wait()
    @        0x107849687 caffe::BlockingQueue<>::peek()
    @        0x1077b6f46 caffe::DataLayer<>::DataLayerSetUp()
    @        0x1077a640e caffe::BasePrefetchingDataLayer<>::LayerSetUp()
    @        0x1078148e7 caffe::Net<>::Init()
    @        0x107813385 caffe::Net<>::Net()
    @        0x10782f090 caffe::Solver<>::InitTrainNet()
    @        0x10782e3e7 caffe::Solver<>::Init()
    @        0x10782e0de caffe::Solver<>::Solver()
    @        0x10783e8a8 caffe::SGDSolver<>::SGDSolver()
    @        0x107844182 caffe::Creator_SGDSolver<>()
    @        0x1076f3137 train()
    @        0x1076f5721 main
    @     0x7fff90c165c9 start
Terminated: 15
[tw-mbp-rshi playgit]$
```

Here is my `solver.prototxt`:

```
# The train/test net protocol buffer definition
net: "net.prototxt"

# test_iter specifies how many forward passes the test should carry out.
# In the case of MNIST, we have test batch size 100 and 100 test iterations,
# covering the full 10,000 testing images.
test_iter: 1

# Carry out testing every 500 training iterations.
test_interval: 2

# The base learning rate, momentum and the weight decay of the network.
base_lr: 0.01
momentum: 0.9
weight_decay: 0.0005

# The learning rate policy
lr_policy: "inv"
gamma: 0.0001
power: 0.75

# Display every 100 iterations
display: 1

# The maximum number of iterations
max_iter: 100

# snapshot intermediate results
snapshot: 5
snapshot_prefix: "snapshot"

# solver mode: CPU or GPU
solver_mode: CPU
```

Here is my `net.prototxt`:

```

layer {
  name: "training"
  type: "Data"
  top: "data"
  include {
    phase: TRAIN
  }
  data_param {
    source: "training"
    backend: LMDB
  }
}

layer {
  name: "testing"
  type: "Data"
  top: "data"
  include {
    phase: TEST
  }
  data_param {
    source: "testing"
    backend: LMDB
  }
}

layer {
  name: "training_label"
  type: "Data"
  top: "label"
  include {
    phase: TRAIN
  }
  data_param {
    source: "training_label"
    batch_size:1
    backend: LMDB
  }
}

layer {
  name: "testing_label"
  type: "Data"
  top: "label"
  include {
    phase: TEST
  }
  data_param {
    source: "testing_label"
    batch_size:1
    backend: LMDB
  }
}



layer {
  name: "full"
  type: "InnerProduct"
  # learning rate and decay multipliers for the weights
  param { lr_mult: 1 decay_mult: 1 }
  # learning rate and decay multipliers for the biases
  param { lr_mult: 2 decay_mult: 0 }
  inner_product_param {
    num_output: 1
    weight_filler {
      type: "gaussian"
      std: 0.01
    }
    bias_filler {
      type: "constant"
      value: 0
    }
  }
  bottom: "data"
  top: "full"
}



layer {
  name: "loss"
  type: "EuclideanLoss"
  bottom: "full"
  bottom: "label"
  top: "loss"
}
```

Here is how I generated the training and label data:

```
import numpy as np
import lmdb
import caffe
import random

N = 100

# Let's pretend this is interesting data
X = np.zeros((N, 2, 1, 1), dtype=np.float)
y = np.zeros(N, dtype=np.float)

random.seed(0)

for i in range(0, N):
    X[i,0,0,0] = random.uniform(8, 10)
    X[i,1,0,0] = random.uniform(6, 8)
    y[i] = X[i,0,0,0] / X[i,1,0,0]


with lmdb.open('training', map_size=int(1e12)) as db:
    with db.begin(write=True) as transaction:
        for i in range(N):
            datum = caffe.proto.caffe_pb2.Datum()
            datum.channels = X.shape[1]
            datum.height = X.shape[2]
            datum.width = X.shape[3]
            datum.data = X[i].tobytes()
            str_id = '{:08}'.format(i)
            #
            # The encode is only essential in Python 3
            transaction.put(str_id.encode('ascii'), datum.SerializeToString())


with lmdb.open('label', map_size=int(1e12)) as db:
    with db.begin(write=True) as transaction:
        for i in range(N):
            datum = caffe.proto.caffe_pb2.Datum()
            datum.channels = 1
            datum.height = 1
            datum.width = 1
            datum.data = y[i].tobytes()
            str_id = '{:08}'.format(i)
            #
            # The encode is only essential in Python 3
            transaction.put(str_id.encode('ascii'), datum.SerializeToString())
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caffe hang when creating data layer #3965

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Caffe hang when creating data layer #3965

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions