Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 10 additions & 6 deletions python/mxnet/metric.py
Original file line number Diff line number Diff line change
Expand Up @@ -380,23 +380,27 @@ def update(self, labels, preds):
Parameters
----------
labels : list of `NDArray`
The labels of the data.
The labels of the data with class indices as values, one per sample.

preds : list of `NDArray`
Predicted values.
Prediction values for samples. Each prediction value can either be the class index,
or a vector of likelihoods for all classes.
"""
check_label_shapes(labels, preds)

for label, pred_label in zip(labels, preds):
if pred_label.shape != label.shape:
pred_label = ndarray.argmax(pred_label, axis=self.axis)
pred_label = pred_label.asnumpy().astype('int32')
label = label.asnumpy().astype('int32')
pred_label = pred_label.astype('int32')
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int64is better?

Copy link
Copy Markdown
Member Author

@szha szha Jan 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires larger space, which can show when the prediction class is large (such as in NLP applications). Should I make it an option?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be good in most cases.

label = label.astype('int32')

check_label_shapes(label, pred_label)

self.sum_metric += (pred_label.flat == label.flat).sum()
self.num_inst += len(pred_label.flat)
if pred_label.context != label.context:
pred_label = pred_label.as_in_context(label.context)

self.sum_metric += (pred_label.flatten() == label.flatten()).sum().asscalar()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asscalar is equivalent to : asnumpy()[0]

This PR did not solve the problem of the metric being computed in the numpy world

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Computation happens before asnumpy() happens, so nothing is happening in the numpy world other than passing out a scalar value.

May I ask what your interest is in this PR? Do you have a use case that benefits from using ndarray for metric?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe my understanding is flawed but in order to be able to bring back this value on the CPU to self.sum_metric, we will need to do a wait_on_read on the pred_label ?
If we have a loop that has:

for batch in data:
   ...
   metric.update()

we will wait for the label_pred to be computed before we load the next batch on the GPU.

Whilst if we had self.sum_metric = nd.zeros(1, ctx=ctx)
self.sum_metric += (pred_label.flatten() == label.flatten()).sum()
then we wouldn't be blocking on this operation before loading the next batch of data and it would just enqueue more operations on the back-end.

I have run several experiments where having the loss computed .asscalar() on every loop is slowing the training by 2-25%.

see:
ilkarman/DeepLearningFrameworks#55
zackchase/mxnet-the-straight-dope#455

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szha to precise what I mean, I don't think using numpy is slower in itself, it is just that having a blocking operation in the loop is limiting the use of the parallelization happening in the backend and lead to GPU starvation.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, completely using the non-blocking logic will cause some other problems. To be more specific, the allocated NDArrays cannot be reused and will finally cause an OOM. We should use asscalar() to avoid this.

Copy link
Copy Markdown
Contributor

@ThomasDelteil ThomasDelteil Mar 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the precision @sxjscience , but why is it so? Once processed in the computation graph aren't they garbage collected?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's because the OPs are pushed in a much faster speed than the real computation. The graph will keep expanding and the allocation OPs will be executed to allocate new space (Even before the actual computation is performed). We have to call a blocking operator at some point to make sure that the current calculation in the graph has been completed. CC @piiswrong for this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks yes that's my understanding. However I think it should be left to the user to decide when to block, since it depends highly on their GPU and model size (like every 100 batches or every epoch). Also is there a reason why the accuracy is stored on the CPU rather than on specific context? My measures showed great improvements when storing the accuracy on GPU. Maybe if you don't mind we can continue the discussion there: #9571

self.num_inst += numpy.prod(pred_label.shape)


@register
Expand Down