Skip to content

Fix tokenization issue of CJK languages for evaluation#20

Merged
mgaido91 merged 16 commits intohlt-mt:mainfrom
owaski:main
Feb 27, 2026
Merged

Fix tokenization issue of CJK languages for evaluation#20
mgaido91 merged 16 commits intohlt-mt:mainfrom
owaski:main

Conversation

@owaski
Copy link
Contributor

@owaski owaski commented Feb 24, 2026

For CJK languages, we need to tokenize them with CJSegmenter before sending them to mweralign.align_texts.

This PR makes the following modifications:

  1. Apply CJSegmenter before calling mweralign.align_texts. This is done for both latency scorer and quality scorer.
  2. Add latency_unit argument to the quality scorer and use this argument to trigger CJSegmenter in the quality scorer.

Copy link
Contributor

@mgaido91 mgaido91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @owaski for you contribution. I know we are not doing great on UTs at the moment, but if you could add one it would be great. If I manage to, I will try and provide a suggestion for it, but since you know better what is the expected output for character-level segmentation.

Comment on lines 98 to 101
if self.args.latency_unit == "char":
segmenter = CJSegmenter()
else:
segmenter = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: shall we move it in the init and have it as self.segmenter? This way, if we have multiple score calls, we have only one instance created.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea. I'll do it.

@mgaido91
Copy link
Contributor

The CI has been fixed in #22 . Please pull from the main branch next time you push so that the CI gets fixed here as well, thanks.

owaski and others added 4 commits February 25, 2026 10:05
Co-authored-by: Marco Gaido <marcogaido91@gmail.com>
Co-authored-by: Marco Gaido <marcogaido91@gmail.com>
@mgaido91
Copy link
Contributor

Regading the UTs, I think it would be great to add a file like this in the UTs:

# Copyright 2026 FBK

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

#     http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License

import unittest
from argparse import Namespace

from simulstream.metrics.readers import OutputWithDelays, ReferenceSentenceDefinition
from simulstream.metrics.scorers.latency import LatencyScoringSample
from simulstream.metrics.scorers.latency.stream_laal import StreamLaal


class StreamLaalTestCase(unittest.TestCase):
    def test_basic(self):
        reference = [
            ReferenceSentenceDefinition(
                "A New York, sono a capo di un'associazione no profit, chiamata Robin Hood.",
                12.61,
                4.07,
            ),
            ReferenceSentenceDefinition(
                "Quando non combatto la povertà, combatto gli incendi come assistente capitano di "
                "una brigata di pompieri volontari.",
                16.9,
                5.14,
            )
        ]
        hypothesis = OutputWithDelays(
            "Tornando a New York, sono il capo dello sviluppo per un non-profit chiamato Robin "
            "Hood. Quando non sto combattendo la povertà, sto combattendo i fuochi.",
            [14.0, 14.0, 14.0, 14.0, 16.0, 16.0, 16.0, 16.0, 16.0, 16.0, 16.0, 16.0, 16.0, 18.0,
             18.0, 18.0, 18.0, 18.0, 18.0, 18.0, 18.0, 20.0, 20.0, 20.0, 20.0],
            [18.22, 18.22, 18.22, 18.22, 19.93, 19.93, 19.93, 19.93, 19.93, 19.93, 19.93, 19.93,
             19.93, 23.01, 23.01, 23.01, 23.01, 23.01, 23.01, 23.01, 23.01, 27.30, 27.30, 27.30,
             27.30,]
        )
        scorer = StreamLaal(Namespace(latency_unit="word"))
        score = scorer.score([LatencyScoringSample("a", hypothesis, reference)])
        self.assertAlmostEqual(score.ideal_latency, 0.868587, 4)
        self.assertAlmostEqual(score.computational_aware_latency, 5.86, 4)
    
    def test_with_characters(self):
        ... # TODO


if __name__ == '__main__':
    unittest.main()

Where we put in the TODO a simple example taken e.g. from past experiments on which we know the expected output latency. And ideally we should add a test case to uts/metrics/log_reader.py to enforce that it works as expected as well. I can also help with these things if you can give a meaningful example, but probably it is faster if you just add it since I have no CJK example at hand. Thanks.

@owaski
Copy link
Contributor Author

owaski commented Feb 25, 2026

Sure I can add a unit test for Chinese.

Copy link
Contributor

@mgaido91 mgaido91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I can add a unit test for Chinese.

That would be awesome, thanks! Once we have those UTs, this LGTM, thanks!

@owaski
Copy link
Contributor Author

owaski commented Feb 25, 2026

Sure I can add a unit test for Chinese.

That would be awesome, thanks! Once we have those UTs, this LGTM, thanks!

unit test added.

@owaski
Copy link
Contributor Author

owaski commented Feb 25, 2026

wait, I found an issue with quality scorer. the self._tokenize function will do in-place modification which changes the sample.reference. let me work on a quick fix

Copy link
Contributor

@mgaido91 mgaido91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a UT to test the problem you mentioned with the quality segmenter? So that we are sure that future changes will not (re-)introduce it. something similar to the latency scorer UT but with the sacrebleu scorer. It would be gold to check that before your last fix the UT fails and after fixing the issue it passes. Then we can merge this, thanks!

@owaski
Copy link
Contributor Author

owaski commented Feb 26, 2026

Can we add a UT to test the problem you mentioned with the quality segmenter? So that we are sure that future changes will not (re-)introduce it. something similar to the latency scorer UT but with the sacrebleu scorer. It would be gold to check that before your last fix the UT fails and after fixing the issue it passes. Then we can merge this, thanks!

Just added the unit test for both quality and latency tokenize function

Copy link
Contributor

@mgaido91 mgaido91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just few minor style comments, LGTM otherwise, thanks. Once these three small things are fixed I will merge it. Thanks.

@owaski
Copy link
Contributor Author

owaski commented Feb 26, 2026

just few minor style comments, LGTM otherwise, thanks. Once these three small things are fixed I will merge it. Thanks.

style fixed!

@mgaido91 mgaido91 merged commit 93c51b4 into hlt-mt:main Feb 27, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants