Fix tokenization issue of CJK languages for evaluation#20
Fix tokenization issue of CJK languages for evaluation#20mgaido91 merged 16 commits intohlt-mt:mainfrom
Conversation
mgaido91
left a comment
There was a problem hiding this comment.
thank you @owaski for you contribution. I know we are not doing great on UTs at the moment, but if you could add one it would be great. If I manage to, I will try and provide a suggestion for it, but since you know better what is the expected output for character-level segmentation.
| if self.args.latency_unit == "char": | ||
| segmenter = CJSegmenter() | ||
| else: | ||
| segmenter = None |
There was a problem hiding this comment.
nit: shall we move it in the init and have it as self.segmenter? This way, if we have multiple score calls, we have only one instance created.
There was a problem hiding this comment.
That's a good idea. I'll do it.
|
The CI has been fixed in #22 . Please pull from the main branch next time you push so that the CI gets fixed here as well, thanks. |
Co-authored-by: Marco Gaido <marcogaido91@gmail.com>
Co-authored-by: Marco Gaido <marcogaido91@gmail.com>
|
Regading the UTs, I think it would be great to add a file like this in the UTs: Where we put in the TODO a simple example taken e.g. from past experiments on which we know the expected output latency. And ideally we should add a test case to |
|
Sure I can add a unit test for Chinese. |
mgaido91
left a comment
There was a problem hiding this comment.
Sure I can add a unit test for Chinese.
That would be awesome, thanks! Once we have those UTs, this LGTM, thanks!
unit test added. |
|
wait, I found an issue with quality scorer. the |
mgaido91
left a comment
There was a problem hiding this comment.
Can we add a UT to test the problem you mentioned with the quality segmenter? So that we are sure that future changes will not (re-)introduce it. something similar to the latency scorer UT but with the sacrebleu scorer. It would be gold to check that before your last fix the UT fails and after fixing the issue it passes. Then we can merge this, thanks!
Just added the unit test for both quality and latency tokenize function |
mgaido91
left a comment
There was a problem hiding this comment.
just few minor style comments, LGTM otherwise, thanks. Once these three small things are fixed I will merge it. Thanks.
style fixed! |
For CJK languages, we need to tokenize them with
CJSegmenterbefore sending them tomweralign.align_texts.This PR makes the following modifications:
CJSegmenterbefore callingmweralign.align_texts. This is done for both latency scorer and quality scorer.latency_unitargument to the quality scorer and use this argument to triggerCJSegmenterin the quality scorer.