Torch generic inference by gheinrich · Pull Request #345 · NVIDIA/DIGITS

gheinrich · 2015-10-05T13:55:30Z

No description provided.

lukeyeager · 2015-10-06T17:36:28Z

I tried pasting in the example network from digits/model/images/generic/test_views.py and got this error:

2015-10-06 10:35:07 [INFO ] subtractMean parameter is not considered as mean image path is unset
2015-10-06 10:35:07 [INFO ] Loading network definition from /raid/jobs/dev/20151006-103505-13e5/model
/home/lyeager/torch/install/bin/luajit: /home/lyeager/torch/install/share/lua/5.1/trepl/init.lua:363: error loading module 'model' from file '/raid/jobs/dev/20151006-103505-13e5/model.lua':
/raid/jobs/dev/20151006-103505-13e5/model.lua:1: unexpected symbol near '+'
stack traceback:
[C]: in function 'error'
/home/lyeager/torch/install/share/lua/5.1/trepl/init.lua:363: in function 'require'
/home/lyeager/digits/tools/torch/main.lua:182: in main chunk
[C]: in function 'dofile'
...ager/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
[C]: at 0x00406670

gheinrich · 2015-10-06T19:24:32Z

Just checking... did you paste the + sign from the diff by any chance?

lukeyeager · 2015-10-06T19:28:20Z

Hmm, my model has a bunch of +'s in it.

+local net = nn.Sequential()
+net:add(nn.MulConstant(0.004))
+net:add(nn.View(-1):setNumInputDims(3))  -- 1*10*10 -> 100
+net:add(nn.Linear(100,2))
+return function(params)
+    return {
+        model = net,
+        loss = nn.MSECriterion(),
+    }   
+end

That's weird.

lukeyeager · 2015-10-06T19:29:26Z

Oh I see what you mean. Haha, whoops!

lukeyeager · 2015-10-06T19:40:25Z

Yep, that's all it was. My bad. New problem:

Traceback (most recent call last):
  File "/home/lyeager/digits/digits/scheduler.py", line 454, in run_task
    task.run(resources)
  File "/home/lyeager/digits/digits/task.py", line 188, in run
    args = self.task_arguments(resources, env )
  File "/home/lyeager/digits/digits/model/tasks/torch_train.py", line 165, in task_arguments
    args.append('--train_labels=%s' % train_labels_db.path(train_labels_db.database))
AttributeError: 'NoneType' object has no attribute 'path'

labels is an optional thing (this could be used for unsupervised learning).

gheinrich · 2015-10-06T19:48:28Z

Oh thanks! I didn't know DIGITS supported unsupervised learning. What kind of networks have you been using? Perhaps I can try an autoencoder...

lukeyeager · 2015-10-06T19:55:36Z

I don't actually have an example - I'm not sure why I had that dataset lying around.

Perhaps I can try an autoencoder...

We still need some kind of a standard network for "generic" networks. Is there a standard autoencoder network out there that would do something interesting for arbitrary input data?

gheinrich · 2015-10-06T20:37:43Z

Apparently BVLC/caffe#330 added one to Caffe for MNIST-like datasets. The model is derived from an article Hinton published (the Caffe implementation seems much simpler as pre-training with RBMs appears to be entirely skipped).

lukeyeager · 2015-10-06T21:18:05Z

Good spot!

I tried it and it seems to run, I guess.

Apparently cross-entropy loss can be [very] negative? I had to hack the graph to show loss values less than 0.

The activations seem to have some concept of spatial relationships, even though the data is flattened. Not sure what's going on there, but it's interesting!

gheinrich · 2015-10-08T07:04:47Z

copy @j-wilson in case he's interested to review. In theory this should allow DIGITS to use Torch to train any feed-forward network but as @lukeyeager pointed out this isn't working yet for unsupervised learning. I am working on this now.

gheinrich · 2015-10-08T13:10:32Z

@lukeyeager I was able to train a simple auto-encoder using Torch. The network compresses MNIST-like images into 100 neurons and then tries to reconstruct the original image. The input may look like:

And the output like this (it kind of looks like the same number - colors are different due to visualization):

I will post a new patch with the code that supports this.

lukeyeager · 2015-10-08T18:13:51Z

That looks promising! I'd love to see this added as a "standard network" for generic inference if we can set the architecture to be somewhat independent of the image size (like AlexNet for Caffe).

gheinrich · 2015-10-09T19:32:08Z

@lukeyeager do you think the autoencoder standard network needs to be added to merge this PR?

lukeyeager · 2015-10-09T21:37:03Z

@lukeyeager do you think the autoencoder standard network needs to be added to merge this PR?

No. Let's do that later. It looks like you updated this to make labels optional - let me review that real quick ...

lukeyeager · 2015-10-09T22:03:42Z

Argh. I still can't get the example to work. Sorry for all the screenshots, but I want to point out how many errors don't produce a helpful error which DIGITS can display.

I tried running the example on 28x28 grayscale images.

Turns out that error means I forgot to include the require 'nn' line.

So I changed net:add(nn.Linear(100,2)) to net:add(nn.Linear(784,2)).

I get that error even if I switch to 10x10 grayscale images and use net:add(nn.Linear(100,2)). I assume the problem is that I'm using 10x10x1 data instead of 1x10x10 data? But I don't know how to change the network to fix it. And neither Darshan nor Boris were able to figure it out when I asked them either.

lukeyeager · 2015-10-09T22:04:23Z

It would be helpful if we put an actual helptip for the "custom torch network" pane. Right now, the caffe helptip is still there, mocking me.

gheinrich · 2015-10-11T19:07:12Z

Admittedly, these error messages are somewhat cryptic. I expect the occasional Torch practitioner to find them mostly straightforward though. You figured out the reason for the first two errors. The third error is due to the fact that you're using an MSE loss which requires the network output and the label to have the same dimension. If you're using the example from models/images/generic/test_views.py then you need to provide vector labels. Perhaps you're using a classification dataset (the label is a scalar and it's being compared against a vector) or maybe you're using a dataset without labels?
These errors are reported by Torch APIs and it is possible but not trivial to catch these errors to report a more helpful message. Either way, I agree we need better documentation. I have updated the Torch documentation with some examples of various network architectures (classification, regression, unsupervised learning with auto-encoder). Also, I am now doing a require('nn') in the Lua wrapper so it is not necessary to do it from the model description.
Note: I have pushed a change to increase timeouts again otherwise I find it difficult to reliably pass Travis tests. I will remove it before the merge.

lukeyeager · 2015-10-12T16:50:24Z

The third error is due to the fact that you're using an MSE loss which requires the network output and the label to have the same dimension.

Oh right, that was dumb - that network is expecting labelled data. D'oh! I've got it working now.

Use digits/dataset/images/generic/test_lmdb_creator.py to create the test data
Use this for my custom Torch network:

local net = nn.Sequential()
net:add(nn.MulConstant(0.004))
net:add(nn.View(-1):setNumInputDims(3))
net:add(nn.Linear(100,2))
return function(params)
    return {
        model = net,
        loss = nn.MSECriterion(),
    }
end

lukeyeager · 2015-10-12T16:56:15Z

I have pushed a change to increase timeouts again otherwise I find it difficult to reliably pass Travis tests. I will remove it before the merge.

Yeah, let's keep extending the timeouts. Sometimes Travis just slows down, and we shouldn't let that kill our build.

Feel free to put that commit in another PR and merge immediately.

lukeyeager · 2015-10-12T16:57:22Z

Ok, I think this is ready to merge unless you have any other concerns. I still don't like the error messages, but our Caffe errors aren't usually much more legible. We can try to address that in later commits.

lukeyeager · 2015-10-12T17:42:27Z

Feel free to put that commit in another PR and merge immediately.

NVM, I did it in #357 already. @jmancewicz, hopefully that will help with some of your testing troubles in other PRs.

gheinrich · 2015-10-12T18:27:58Z

I think/hope it's good enough for merging.
I agree error messages should be improved. I will see how I can add checks and assert statements in the Lua wrappers to catch common errors.

Torch generic inference

lukeyeager · 2015-10-13T17:30:45Z

Merged. I forgot to ask you to remove your "Increase timeouts" commit before merging. Oh well.

lukeyeager added enhancement torch labels Oct 6, 2015

Torch generic inference

5a34077

gheinrich force-pushed the dev/torch_generic branch 2 times, most recently from 9d50e9f to 7f3d5c0 Compare October 9, 2015 08:53

gheinrich force-pushed the dev/torch_generic branch from 7f3d5c0 to ca1e694 Compare October 11, 2015 15:01

gheinrich added 2 commits October 11, 2015 20:10

Torch unsupervised learning

13d28b4

Increase timeouts

f63e050

gheinrich force-pushed the dev/torch_generic branch from ca1e694 to f63e050 Compare October 11, 2015 18:13

lukeyeager mentioned this pull request Oct 12, 2015

Increase timeouts #357

Merged

lukeyeager added a commit that referenced this pull request Oct 13, 2015

Merge pull request #345 from gheinrich/dev/torch_generic

bb2b76f

Torch generic inference

lukeyeager merged commit bb2b76f into NVIDIA:master Oct 13, 2015

lukeyeager mentioned this pull request Oct 13, 2015

Tests fail when Torch not installed #363

Closed

gheinrich mentioned this pull request Oct 16, 2015

Fix Torch HDF5 loading from multiple databases #370

Merged

gheinrich deleted the dev/torch_generic branch October 31, 2015 17:21

Conversation

gheinrich commented Oct 5, 2015

Uh oh!

lukeyeager commented Oct 6, 2015

Uh oh!

gheinrich commented Oct 6, 2015

Uh oh!

lukeyeager commented Oct 6, 2015

Uh oh!

lukeyeager commented Oct 6, 2015

Uh oh!

lukeyeager commented Oct 6, 2015

Uh oh!

gheinrich commented Oct 6, 2015

Uh oh!

lukeyeager commented Oct 6, 2015

Uh oh!

gheinrich commented Oct 6, 2015

Uh oh!

lukeyeager commented Oct 6, 2015

Uh oh!

gheinrich commented Oct 8, 2015

Uh oh!

gheinrich commented Oct 8, 2015

Uh oh!

lukeyeager commented Oct 8, 2015

Uh oh!

gheinrich commented Oct 9, 2015

Uh oh!

lukeyeager commented Oct 9, 2015

Uh oh!

lukeyeager commented Oct 9, 2015

Uh oh!

lukeyeager commented Oct 9, 2015

Uh oh!

gheinrich commented Oct 11, 2015

Uh oh!

lukeyeager commented Oct 12, 2015

Uh oh!

lukeyeager commented Oct 12, 2015

Uh oh!

lukeyeager commented Oct 12, 2015

Uh oh!

lukeyeager commented Oct 12, 2015

Uh oh!

gheinrich commented Oct 12, 2015

Uh oh!

lukeyeager commented Oct 13, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments