Separate progress for fine-tuning and inferencing #191

pmachapman · 2025-06-04T03:40:23Z

This change is

codecov-commenter · 2025-06-04T03:43:28Z

Codecov Report

Attention: Patch coverage is 88.88889% with 1 line in your changes missing coverage. Please review.

Project coverage is 88.94%. Comparing base (6a9fe26) to head (047b01d).

Files with missing lines	Patch %	Lines
machine/jobs/nmt_engine_build_job.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #191   +/-   ##
=======================================
  Coverage   88.93%   88.94%           
=======================================
  Files         282      282           
  Lines       17056    17062    +6     
=======================================
+ Hits        15169    15175    +6     
  Misses       1887     1887

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Enkidu93

Reviewed 7 of 7 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ddaspit)

machine/utils/phased_progress_reporter.py line 15 at r1 (raw file):

    percentage: float = 0
    report_steps: bool = True
    stage: Optional[Literal["train", "inference"]] = None

Is this worth an Enum?

pmachapman

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ddaspit and @Enkidu93)

machine/utils/phased_progress_reporter.py line 15 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Is this worth an Enum?

I've made it an enum in the Serval PR as well. My primary motivation was for consistency and to help strongly type the C# API.

ddaspit

Reviewed 7 of 7 files at r1, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @Enkidu93)

machine/utils/phased_progress_reporter.py line 15 at r1 (raw file):

Previously, pmachapman (Peter Chapman) wrote…

I've made it an enum in the Serval PR as well. My primary motivation was for consistency and to help strongly type the C# API.

This class is intended to be general-purpose, so I would just make this a string.

machine/jobs/build_nmt_engine.py line 47 at r1 (raw file):

                if status.phase_stage is not None:
                    if status.phase_step is not None:
                        task.get_logger().report_single_value(

It looks like there is already code in place to pass multiple progress-related properties through the task runtime properties. It is currently used for SMT and word alignment jobs. The update_runtime_properties function makes a call to the /tasks.edit endpoint on the ClearML REST API to update the runtime property on the task. I believe that this is how the Task.set_progress method works. We should use a consistent approach for all job types. Apart from the fact that ClearML passes progress information using the runtime properties, I feel like there was a reason why we couldn't use the logger to pass progress information, but I can't remember why.

Enkidu93

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @ddaspit and @pmachapman)

machine/utils/phased_progress_reporter.py line 15 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This class is intended to be general-purpose, so I would just make this a string.

I'm sorry - I meant defining an Enum class instead of just the optional literals. But sounds like Damien wants it to be generic 👍

pmachapman

Reviewable status: 6 of 7 files reviewed, 2 unresolved discussions (waiting on @ddaspit and @Enkidu93)

machine/jobs/build_nmt_engine.py line 47 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

It looks like there is already code in place to pass multiple progress-related properties through the task runtime properties. It is currently used for SMT and word alignment jobs. The update_runtime_properties function makes a call to the /tasks.edit endpoint on the ClearML REST API to update the runtime property on the task. I believe that this is how the Task.set_progress method works. We should use a consistent approach for all job types. Apart from the fact that ClearML passes progress information using the runtime properties, I feel like there was a reason why we couldn't use the logger to pass progress information, but I can't remember why.

From what I can tell update_runtime_properties only sets the overall progress, not progress for subsystems?

I don't think writing to the runtime properties is the correct place, as apart from the overall progress value, the only other values set their are related to the machine/task runtime and versions (https://github.com/search?q=repo%3Aclearml%2Fclearml%20_set_runtime_properties&type=code)

From what I can tell, logging scalars is the correct place for this information (https://clear.ml/docs/latest/docs/webapp/webapp_exp_track_visual/#scalars) and the report_single_value function I use backs onto the report_scalar function (https://github.com/clearml/clearml/blob/52e053818206c4df9a653ac0411c9bcfd504006a/clearml/logger.py#L196).

I notice that the logger's report_scalar code fires an event rather than an async task, so perhaps it will be slight less time-accurate (or even non-guaranteed delivery?) than writing it to the runtime (which I think is a locking operation on the task's runtime - https://github.com/clearml/clearml/blob/52e053818206c4df9a653ac0411c9bcfd504006a/clearml/backend_interface/task/task.py#L340), but for our purposes, this is OK?

machine/utils/phased_progress_reporter.py line 15 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

I'm sorry - I meant defining an Enum class instead of just the optional literals. But sounds like Damien wants it to be generic 👍

Done. It is now a string.

Enkidu93

Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ddaspit)

ddaspit

Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @pmachapman)

machine/jobs/build_nmt_engine.py line 47 at r1 (raw file):

Previously, pmachapman (Peter Chapman) wrote…

From what I can tell update_runtime_properties only sets the overall progress, not progress for subsystems?

I don't think writing to the runtime properties is the correct place, as apart from the overall progress value, the only other values set their are related to the machine/task runtime and versions (https://github.com/search?q=repo%3Aclearml%2Fclearml%20_set_runtime_properties&type=code)

From what I can tell, logging scalars is the correct place for this information (https://clear.ml/docs/latest/docs/webapp/webapp_exp_track_visual/#scalars) and the report_single_value function I use backs onto the report_scalar function (https://github.com/clearml/clearml/blob/52e053818206c4df9a653ac0411c9bcfd504006a/clearml/logger.py#L196).

I notice that the logger's report_scalar code fires an event rather than an async task, so perhaps it will be slight less time-accurate (or even non-guaranteed delivery?) than writing it to the runtime (which I think is a locking operation on the task's runtime - https://github.com/clearml/clearml/blob/52e053818206c4df9a653ac0411c9bcfd504006a/clearml/backend_interface/task/task.py#L340), but for our purposes, this is OK?

ClearML uses the runtime properties for progress. SMT and word alignment use it for extra progress fields (messages). The report_single_value function is used to report a single value, such as the overall BLEU score. It isn't really intended to be used for values that change throughout the lifetime of a job, such as progress, which is why ClearML uses the runtime properties for it. I also don't want to have multiple ways of passing progress information.

pmachapman

Reviewable status: 5 of 9 files reviewed, 1 unresolved discussion (waiting on @ddaspit and @Enkidu93)

machine/jobs/build_nmt_engine.py line 47 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

ClearML uses the runtime properties for progress. SMT and word alignment use it for extra progress fields (messages). The report_single_value function is used to report a single value, such as the overall BLEU score. It isn't really intended to be used for values that change throughout the lifetime of a job, such as progress, which is why ClearML uses the runtime properties for it. I also don't want to have multiple ways of passing progress information.

Done. Thank you for your patience!

Enkidu93

Reviewed 4 of 4 files at r3, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ddaspit)

ddaspit

Reviewed 4 of 4 files at r3, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @pmachapman)

pmachapman

Reviewed all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @pmachapman)

pmachapman force-pushed the separate_progress branch from c6f5743 to cc8e549 Compare June 4, 2025 03:40

pmachapman requested a review from ddaspit June 4, 2025 03:41

Enkidu93 requested changes Jun 4, 2025

View reviewed changes

pmachapman commented Jun 4, 2025

View reviewed changes

ddaspit requested changes Jun 5, 2025

View reviewed changes

Enkidu93 reviewed Jun 6, 2025

View reviewed changes

pmachapman force-pushed the separate_progress branch 2 times, most recently from b8bcca6 to dface12 Compare June 9, 2025 01:38

pmachapman commented Jun 9, 2025

View reviewed changes

Enkidu93 approved these changes Jun 9, 2025

View reviewed changes

ddaspit reviewed Jun 9, 2025

View reviewed changes

pmachapman force-pushed the separate_progress branch 3 times, most recently from bbd9714 to a33d110 Compare June 11, 2025 02:42

pmachapman commented Jun 11, 2025

View reviewed changes

Enkidu93 reviewed Jun 11, 2025

View reviewed changes

ddaspit approved these changes Jun 19, 2025

View reviewed changes

ddaspit force-pushed the separate_progress branch from a33d110 to 835f894 Compare June 20, 2025 16:54

pmachapman added 4 commits June 20, 2025 20:13

Add progress reporting for individual phases

248ea44

Add progress reporting for SMT and word alignment builds

73195cb

Change phase stage to a general purpose string

a33ee2b

Report the phase in the runtime

047b01d

ddaspit force-pushed the separate_progress branch from 835f894 to 047b01d Compare June 21, 2025 02:13

pmachapman commented Jun 21, 2025

View reviewed changes

ddaspit merged commit cd21706 into main Jun 23, 2025
14 checks passed

ddaspit deleted the separate_progress branch June 23, 2025 18:29

Uh oh!

Separate progress for fine-tuning and inferencing #191

Separate progress for fine-tuning and inferencing #191

Uh oh!

Conversation

pmachapman commented Jun 4, 2025 • edited by ddaspit Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Enkidu93 left a comment

Choose a reason for hiding this comment

Uh oh!

pmachapman left a comment

Choose a reason for hiding this comment

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

Enkidu93 left a comment

Choose a reason for hiding this comment

Uh oh!

pmachapman left a comment

Choose a reason for hiding this comment

Uh oh!

Enkidu93 left a comment

Choose a reason for hiding this comment

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

pmachapman left a comment

Choose a reason for hiding this comment

Uh oh!

Enkidu93 left a comment

Choose a reason for hiding this comment

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

pmachapman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pmachapman commented Jun 4, 2025 •

edited by ddaspit

Loading

codecov-commenter commented Jun 4, 2025 •

edited

Loading