Add vocabulary and embedding#10074
Add vocabulary and embedding#10074szha merged 20 commits intoapache:nlp_toolkitfrom astonzhang:nlp_toolkit
Conversation
* [REVIEW REQUIRED] Revert PR #9484 & add additional dependency licenses to LICENSE file (#9701) * Revert "[Review Required] Fixing Licenses: Cleaning up the Top Level LICENSE file (#9484)" This reverts commit 8930d96. * Some more LICENSE fixes * Adding some more packages to the LICENSE file * Adding dependencies of dependencies * update v1.1.0 change log to NEWS.md * sync README.md from v1.1.0 branch * revert to correct jenkins url in README
* parallelization for roipooling * remove some useless computation * remove useless muls * add author and retriggering * retrigger again
* Bug Fix and performance optimized for rtc 1. "super().__init__()" bug is fixed in python 2. 2. Kernel is initialized in the stage of operator init. * Update custom_softmax_rtc.py fix unnessesary format
|
|
||
|
|
||
| def test_token_embedding_from_file(): | ||
| embed_root = 'embedding' |
There was a problem hiding this comment.
Please use a tempfile instead
|
|
||
|
|
||
| def test_vocab_set_embedding_with_one_custom_embedding(): | ||
| embed_root = 'embedding' |
There was a problem hiding this comment.
Please use a tempfile instead
|
|
||
|
|
||
| def test_vocabulary_with_two_custom_embeddings(): | ||
| embed_root = '.' |
There was a problem hiding this comment.
Please use a tempfile instead
| # coding: utf-8 | ||
| # pylint: disable=consider-iterating-dictionary | ||
| # pylint: disable=super-init-not-called | ||
| # pylint: disable=arguments-differ |
There was a problem hiding this comment.
Are the last two pylint ignores really invalid?
|
This should be in gluon. |
|
|
||
| Examples | ||
| -------- | ||
| >>> @mxnet.contrib.text.embedding.register |
|
|
||
| from . import _constants as C | ||
| from mxnet import ndarray as nd | ||
| from mxnet import nd |
|
|
||
| ```python | ||
| >>> text_data = " hello world \n hello nice world \n hi world \n" | ||
| >>> counter = text.count_tokens_from_str(text_data) |
There was a problem hiding this comment.
It doesn't seem necessary to create vocab just to access embedding vector.
|
|
||
| ``` | ||
|
|
||
| The obtained `counter` has key-value pairs whose keys are words and values are word frequencies. |
There was a problem hiding this comment.
Should explain why a counter is needed first.
| file is: | ||
|
|
||
| '(token_1)(ed))v_11)(ed)(v_12)(ed)...(ed)(v_1k)\\\\n | ||
| (token_2)(ed)(v_21)(ed)(v_22)(ed)...(ed)(v_2k)\\\\n...' |
There was a problem hiding this comment.
Use an example for the file format inside a code block so that it's easier to understand the file format. Currently it looks confusing. http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-10074/10/api/python/gluon/text.html#mxnet.gluon.text.embedding.TokenEmbedding.from_file
| Examples | ||
| -------- | ||
| >>> fasttext = text.embedding.create('fasttext', file_name='wiki.simple.vec') | ||
| >>> text_data = " hello world \n hello nice world \n hi world \n" |
There was a problem hiding this comment.
| def __len__(self): | ||
| return len(self._idx_to_token) | ||
|
|
||
| def set_embedding(self, embeddings): |
There was a problem hiding this comment.
Use (self, *embeddings) instead. embeddings should not be a list. After the change, it should be possible to do:
vocab.set_embedding(fasttext_emb, glove_embed)There was a problem hiding this comment.
Remember to update the doc/example accordingly.
There was a problem hiding this comment.
resolved. with updated test cases
| frequent words 'world' and 'hello' are also indexed. | ||
|
|
||
|
|
||
| ### Assign token embedding to vocabulary |
There was a problem hiding this comment.
Assign doesn't seem like the right verb. Maybe attach?
|
|
||
| @property | ||
| def reserved_tokens(self): | ||
| return self._reserved_tokens |
There was a problem hiding this comment.
Should the reserved_tokens property always include unknown_token, given that they are all indexed first?
There was a problem hiding this comment.
resolved with more detailed api specification.
* [MXNET-67] Sync master with v1.1.0 branch (apache#10031) * [REVIEW REQUIRED] Revert PR apache#9484 & add additional dependency licenses to LICENSE file (apache#9701) * Revert "[Review Required] Fixing Licenses: Cleaning up the Top Level LICENSE file (apache#9484)" This reverts commit 8930d96. * Some more LICENSE fixes * Adding some more packages to the LICENSE file * Adding dependencies of dependencies * update v1.1.0 change log to NEWS.md * sync README.md from v1.1.0 branch * revert to correct jenkins url in README * Parallelization for ROIpooling OP (apache#9958) * parallelization for roipooling * remove some useless computation * remove useless muls * add author and retriggering * retrigger again * comments to copy and copyto are corrected (apache#10040) * Bug Fix and performance optimized for rtc (apache#10018) * Bug Fix and performance optimized for rtc 1. "super().__init__()" bug is fixed in python 2. 2. Kernel is initialized in the stage of operator init. * Update custom_softmax_rtc.py fix unnessesary format * set embedding * Code and test revised * api implementation done * license and news * readme and cpp * pylint disable * Add API doc * less pylint disable * remove contrib * move to gluon, revise api doc * fix import order * re-test * relative imports * re-run test * revise implementation, test case, and api doc * re-test
Description
Add vocabulary and embedding
Checklist
Essentials
make lint)Changes
Comments