Add vocabulary and embedding by astonzhang · Pull Request #10074 · apache/mxnet

astonzhang · 2018-03-12T16:27:28Z

Description

Add vocabulary and embedding

Checklist

Essentials

Passed code style checking (make lint)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

utils, (and when applicable, API doc)
vocab, (and when applicable, API doc)
embedding, (and when applicable, API doc)

Comments

NA

* [REVIEW REQUIRED] Revert PR #9484 & add additional dependency licenses to LICENSE file (#9701) * Revert "[Review Required] Fixing Licenses: Cleaning up the Top Level LICENSE file (#9484)" This reverts commit 8930d96. * Some more LICENSE fixes * Adding some more packages to the LICENSE file * Adding dependencies of dependencies * update v1.1.0 change log to NEWS.md * sync README.md from v1.1.0 branch * revert to correct jenkins url in README

* parallelization for roipooling * remove some useless computation * remove useless muls * add author and retriggering * retrigger again

* Bug Fix and performance optimized for rtc 1. "super().__init__()" bug is fixed in python 2. 2. Kernel is initialized in the stage of operator init. * Update custom_softmax_rtc.py fix unnessesary format

marcoabreu · 2018-03-12T17:58:15Z

+
+
+def test_token_embedding_from_file():
+    embed_root = 'embedding'


Please use a tempfile instead

marcoabreu · 2018-03-12T17:58:22Z

+
+
+def test_vocab_set_embedding_with_one_custom_embedding():
+    embed_root = 'embedding'


Please use a tempfile instead

marcoabreu · 2018-03-12T17:58:32Z

+
+
+def test_vocabulary_with_two_custom_embeddings():
+    embed_root = '.'


Please use a tempfile instead

marcoabreu · 2018-03-12T18:00:57Z

+# coding: utf-8
+# pylint: disable=consider-iterating-dictionary
+# pylint: disable=super-init-not-called
+# pylint: disable=arguments-differ


Are the last two pylint ignores really invalid?

piiswrong · 2018-03-13T10:46:45Z

This should be in gluon.
If we put vocab and embedding in mxnet.text and textdataset in gluon it's going to be really confusing

kobenaxie · 2018-03-13T14:53:04Z

+
    Examples
    --------
    >>> @mxnet.contrib.text.embedding.register


>>> @mxnet.text.embedding.register

cjolivier01

Please add a JIRA

szha · 2018-03-14T02:56:26Z


-from . import _constants as C
-from mxnet import ndarray as nd
+from mxnet import nd


we usually use relative import.

szha · 2018-03-14T23:46:04Z

+
+```python
+>>> text_data = " hello world \n hello nice world \n hi world \n"
+>>> counter = text.count_tokens_from_str(text_data)


It doesn't seem necessary to create vocab just to access embedding vector.

szha · 2018-03-14T23:46:27Z

+
+```
+
+The obtained `counter` has key-value pairs whose keys are words and values are word frequencies.


Should explain why a counter is needed first.

szha · 2018-03-14T23:48:01Z

+        file is:
+
+        '(token_1)(ed))v_11)(ed)(v_12)(ed)...(ed)(v_1k)\\\\n
+        (token_2)(ed)(v_21)(ed)(v_22)(ed)...(ed)(v_2k)\\\\n...'


Use an example for the file format inside a code block so that it's easier to understand the file format. Currently it looks confusing. http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-10074/10/api/python/gluon/text.html#mxnet.gluon.text.embedding.TokenEmbedding.from_file

szha · 2018-03-14T23:48:51Z

+    Examples
+    --------
+    >>> fasttext = text.embedding.create('fasttext', file_name='wiki.simple.vec')
+    >>> text_data = " hello world \n hello nice world \n hi world \n"


Formatting is broken. http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-10074/10/api/python/gluon/text.html#mxnet.gluon.text.vocab.Vocabulary

szha · 2018-03-14T23:50:15Z

+    def __len__(self):
+        return len(self._idx_to_token)
+
+    def set_embedding(self, embeddings):


Use (self, *embeddings) instead. embeddings should not be a list. After the change, it should be possible to do:

vocab.set_embedding(fasttext_emb, glove_embed)

Remember to update the doc/example accordingly.

resolved. with updated test cases

szha · 2018-03-14T23:53:23Z

+frequent words 'world' and 'hello' are also indexed.
+
+
+### Assign token embedding to vocabulary


Assign doesn't seem like the right verb. Maybe attach?

szha · 2018-03-15T00:13:12Z

+
+    @property
+    def reserved_tokens(self):
+        return self._reserved_tokens


Should the reserved_tokens property always include unknown_token, given that they are all indexed first?

resolved with more detailed api specification.

* [MXNET-67] Sync master with v1.1.0 branch (apache#10031) * [REVIEW REQUIRED] Revert PR apache#9484 & add additional dependency licenses to LICENSE file (apache#9701) * Revert "[Review Required] Fixing Licenses: Cleaning up the Top Level LICENSE file (apache#9484)" This reverts commit 8930d96. * Some more LICENSE fixes * Adding some more packages to the LICENSE file * Adding dependencies of dependencies * update v1.1.0 change log to NEWS.md * sync README.md from v1.1.0 branch * revert to correct jenkins url in README * Parallelization for ROIpooling OP (apache#9958) * parallelization for roipooling * remove some useless computation * remove useless muls * add author and retriggering * retrigger again * comments to copy and copyto are corrected (apache#10040) * Bug Fix and performance optimized for rtc (apache#10018) * Bug Fix and performance optimized for rtc 1. "super().__init__()" bug is fixed in python 2. 2. Kernel is initialized in the stage of operator init. * Update custom_softmax_rtc.py fix unnessesary format * set embedding * Code and test revised * api implementation done * license and news * readme and cpp * pylint disable * Add API doc * less pylint disable * remove contrib * move to gluon, revise api doc * fix import order * re-test * relative imports * re-run test * revise implementation, test case, and api doc * re-test

yzhliu and others added 9 commits March 11, 2018 15:58

Parallelization for ROIpooling OP (#9958)

59f0306

* parallelization for roipooling * remove some useless computation * remove useless muls * add author and retriggering * retrigger again

comments to copy and copyto are corrected (#10040)

1e270b1

Bug Fix and performance optimized for rtc (#10018)

63074ce

* Bug Fix and performance optimized for rtc 1. "super().__init__()" bug is fixed in python 2. 2. Kernel is initialized in the stage of operator init. * Update custom_softmax_rtc.py fix unnessesary format

set embedding

df974e0

Code and test revised

9c806f5

api implementation done

5797aab

license and news

a2215ca

readme and cpp

c69cb07

astonzhang requested a review from szha as a code owner March 12, 2018 16:27

Aston Zhang added 2 commits March 12, 2018 09:46

pylint disable

1863d91

Add API doc

c378669

marcoabreu reviewed Mar 12, 2018

View reviewed changes

less pylint disable

5edca9d

kobenaxie reviewed Mar 13, 2018

View reviewed changes

cjolivier01 suggested changes Mar 13, 2018

View reviewed changes

Aston Zhang added 4 commits March 13, 2018 11:49

remove contrib

c208477

move to gluon, revise api doc

56d5307

fix import order

5ba2225

re-test

47d7ed4

szha reviewed Mar 14, 2018

View reviewed changes

Aston Zhang added 2 commits March 14, 2018 07:13

relative imports

63923db

re-run test

616cff9

szha reviewed Mar 14, 2018

View reviewed changes

szha reviewed Mar 15, 2018

View reviewed changes

Aston Zhang added 2 commits March 14, 2018 18:58

revise implementation, test case, and api doc

14735e1

re-test

240ef86

szha merged commit 092df8f into apache:nlp_toolkit Mar 15, 2018



		def test_token_embedding_from_file():
		embed_root = 'embedding'



		def test_vocab_set_embedding_with_one_custom_embedding():
		embed_root = 'embedding'



		def test_vocabulary_with_two_custom_embeddings():
		embed_root = '.'


		```

		The obtained `counter` has key-value pairs whose keys are words and values are word frequencies.

		frequent words 'world' and 'hello' are also indexed.


		### Assign token embedding to vocabulary

Conversation

astonzhang commented Mar 12, 2018

Description

Checklist

Essentials

Changes

Comments

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

piiswrong commented Mar 13, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cjolivier01 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants