Skip to content
This repository was archived by the owner on Jan 7, 2025. It is now read-only.

Torch generic inference#345

Merged
lukeyeager merged 3 commits intoNVIDIA:masterfrom
gheinrich:dev/torch_generic
Oct 13, 2015
Merged

Torch generic inference#345
lukeyeager merged 3 commits intoNVIDIA:masterfrom
gheinrich:dev/torch_generic

Conversation

@gheinrich
Copy link
Contributor

No description provided.

@lukeyeager
Copy link
Member

I tried pasting in the example network from digits/model/images/generic/test_views.py and got this error:

2015-10-06 10:35:07 [INFO ] subtractMean parameter is not considered as mean image path is unset
2015-10-06 10:35:07 [INFO ] Loading network definition from /raid/jobs/dev/20151006-103505-13e5/model
/home/lyeager/torch/install/bin/luajit: /home/lyeager/torch/install/share/lua/5.1/trepl/init.lua:363: error loading module 'model' from file '/raid/jobs/dev/20151006-103505-13e5/model.lua':
/raid/jobs/dev/20151006-103505-13e5/model.lua:1: unexpected symbol near '+'
stack traceback:
[C]: in function 'error'
/home/lyeager/torch/install/share/lua/5.1/trepl/init.lua:363: in function 'require'
/home/lyeager/digits/tools/torch/main.lua:182: in main chunk
[C]: in function 'dofile'
...ager/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
[C]: at 0x00406670

@gheinrich
Copy link
Contributor Author

Just checking... did you paste the + sign from the diff by any chance?

@lukeyeager
Copy link
Member

Hmm, my model has a bunch of +'s in it.

+local net = nn.Sequential()
+net:add(nn.MulConstant(0.004))
+net:add(nn.View(-1):setNumInputDims(3))  -- 1*10*10 -> 100
+net:add(nn.Linear(100,2))
+return function(params)
+    return {
+        model = net,
+        loss = nn.MSECriterion(),
+    }   
+end

That's weird.

@lukeyeager
Copy link
Member

Oh I see what you mean. Haha, whoops!

@lukeyeager
Copy link
Member

Yep, that's all it was. My bad. New problem:

Traceback (most recent call last):
  File "/home/lyeager/digits/digits/scheduler.py", line 454, in run_task
    task.run(resources)
  File "/home/lyeager/digits/digits/task.py", line 188, in run
    args = self.task_arguments(resources, env )
  File "/home/lyeager/digits/digits/model/tasks/torch_train.py", line 165, in task_arguments
    args.append('--train_labels=%s' % train_labels_db.path(train_labels_db.database))
AttributeError: 'NoneType' object has no attribute 'path'

labels is an optional thing (this could be used for unsupervised learning).

@gheinrich
Copy link
Contributor Author

Oh thanks! I didn't know DIGITS supported unsupervised learning. What kind of networks have you been using? Perhaps I can try an autoencoder...

@lukeyeager
Copy link
Member

I don't actually have an example - I'm not sure why I had that dataset lying around.

Perhaps I can try an autoencoder...

We still need some kind of a standard network for "generic" networks. Is there a standard autoencoder network out there that would do something interesting for arbitrary input data?

@gheinrich
Copy link
Contributor Author

Apparently BVLC/caffe#330 added one to Caffe for MNIST-like datasets. The model is derived from an article Hinton published (the Caffe implementation seems much simpler as pre-training with RBMs appears to be entirely skipped).

@lukeyeager
Copy link
Member

Good spot!

I tried it and it seems to run, I guess.

autoencoder-graph

Apparently cross-entropy loss can be [very] negative? I had to hack the graph to show loss values less than 0.

autoencoder-activations

The activations seem to have some concept of spatial relationships, even though the data is flattened. Not sure what's going on there, but it's interesting!

@gheinrich
Copy link
Contributor Author

copy @j-wilson in case he's interested to review. In theory this should allow DIGITS to use Torch to train any feed-forward network but as @lukeyeager pointed out this isn't working yet for unsupervised learning. I am working on this now.

@gheinrich
Copy link
Contributor Author

@lukeyeager I was able to train a simple auto-encoder using Torch. The network compresses MNIST-like images into 100 neurons and then tries to reconstruct the original image. The input may look like:
autoencoder-input
And the output like this (it kind of looks like the same number - colors are different due to visualization):
autoencoder-output
I will post a new patch with the code that supports this.

@lukeyeager
Copy link
Member

That looks promising! I'd love to see this added as a "standard network" for generic inference if we can set the architecture to be somewhat independent of the image size (like AlexNet for Caffe).

@gheinrich gheinrich force-pushed the dev/torch_generic branch 2 times, most recently from 9d50e9f to 7f3d5c0 Compare October 9, 2015 08:53
@gheinrich
Copy link
Contributor Author

@lukeyeager do you think the autoencoder standard network needs to be added to merge this PR?

@lukeyeager
Copy link
Member

@lukeyeager do you think the autoencoder standard network needs to be added to merge this PR?

No. Let's do that later. It looks like you updated this to make labels optional - let me review that real quick ...

@lukeyeager
Copy link
Member

Argh. I still can't get the example to work. Sorry for all the screenshots, but I want to point out how many errors don't produce a helpful error which DIGITS can display.

I tried running the example on 28x28 grayscale images.

torch-generic-no-nn-error

Turns out that error means I forgot to include the require 'nn' line.

torch-generic-size-mismatch

So I changed net:add(nn.Linear(100,2)) to net:add(nn.Linear(784,2)).

torch-generic-bad-argument

I get that error even if I switch to 10x10 grayscale images and use net:add(nn.Linear(100,2)). I assume the problem is that I'm using 10x10x1 data instead of 1x10x10 data? But I don't know how to change the network to fix it. And neither Darshan nor Boris were able to figure it out when I asked them either.

@lukeyeager
Copy link
Member

It would be helpful if we put an actual helptip for the "custom torch network" pane. Right now, the caffe helptip is still there, mocking me.

@gheinrich
Copy link
Contributor Author

Admittedly, these error messages are somewhat cryptic. I expect the occasional Torch practitioner to find them mostly straightforward though. You figured out the reason for the first two errors. The third error is due to the fact that you're using an MSE loss which requires the network output and the label to have the same dimension. If you're using the example from models/images/generic/test_views.py then you need to provide vector labels. Perhaps you're using a classification dataset (the label is a scalar and it's being compared against a vector) or maybe you're using a dataset without labels?
These errors are reported by Torch APIs and it is possible but not trivial to catch these errors to report a more helpful message. Either way, I agree we need better documentation. I have updated the Torch documentation with some examples of various network architectures (classification, regression, unsupervised learning with auto-encoder). Also, I am now doing a require('nn') in the Lua wrapper so it is not necessary to do it from the model description.
Note: I have pushed a change to increase timeouts again otherwise I find it difficult to reliably pass Travis tests. I will remove it before the merge.

@lukeyeager
Copy link
Member

The third error is due to the fact that you're using an MSE loss which requires the network output and the label to have the same dimension.

Oh right, that was dumb - that network is expecting labelled data. D'oh! I've got it working now.

  1. Use digits/dataset/images/generic/test_lmdb_creator.py to create the test data
  2. Use this for my custom Torch network:
local net = nn.Sequential()
net:add(nn.MulConstant(0.004))
net:add(nn.View(-1):setNumInputDims(3))
net:add(nn.Linear(100,2))
return function(params)
    return {
        model = net,
        loss = nn.MSECriterion(),
    }
end

torch-generic-working

@lukeyeager
Copy link
Member

I have pushed a change to increase timeouts again otherwise I find it difficult to reliably pass Travis tests. I will remove it before the merge.

Yeah, let's keep extending the timeouts. Sometimes Travis just slows down, and we shouldn't let that kill our build.

Feel free to put that commit in another PR and merge immediately.

@lukeyeager
Copy link
Member

Ok, I think this is ready to merge unless you have any other concerns. I still don't like the error messages, but our Caffe errors aren't usually much more legible. We can try to address that in later commits.

@lukeyeager lukeyeager mentioned this pull request Oct 12, 2015
@lukeyeager
Copy link
Member

Feel free to put that commit in another PR and merge immediately.

NVM, I did it in #357 already. @jmancewicz, hopefully that will help with some of your testing troubles in other PRs.

@gheinrich
Copy link
Contributor Author

I think/hope it's good enough for merging.
I agree error messages should be improved. I will see how I can add checks and assert statements in the Lua wrappers to catch common errors.

lukeyeager added a commit that referenced this pull request Oct 13, 2015
@lukeyeager lukeyeager merged commit bb2b76f into NVIDIA:master Oct 13, 2015
@lukeyeager
Copy link
Member

Merged. I forgot to ask you to remove your "Increase timeouts" commit before merging. Oh well.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments