Fix/evaluation coverage consistency #2

TangChao729 · 2025-08-18T14:11:32Z

Fix: Evaluation Coverage Inconsistency Across Batch Sizes

Problem

The current evaluate() function produces inconsistent validation loss based on batch_size configuration, making model comparisons unfair. Models with different batch sizes evaluate different amounts of validation data but use the same normalization denominator.

Root Cause

iter_full_split() creates non-overlapping windows of size batch_size × block_size + 1
Number of evaluation windows varies: floor((len(val_ids) - span) / span) + 1
Loss calculation: sum(token_losses) / len(val_text) (characters)
Same character denominator, different token numerators → batch-size-dependent metrics

Example:

when (batch_size × block_size + 1) / len(val_text) < 2 : 1 window → artificially low loss
when (batch_size × block_size + 1) / len(val_text) > 2: : more then 2 window → higher loss for identical model

Solution

Added create_evaluation_functions() factory that provides:

Original function (evaluate_char_normalized) - unchanged for compatibility
Fixed function (evaluate_token_average) - consistent per-token normalization

Key fix: sum(token_losses) / total_tokens_evaluated instead of character count

Implementation

Tracks actual tokens evaluated with total_tokens += yb.numel()
Normalizes by token count: sum_nll / max(1, total_tokens)
Dual logging for side-by-side comparison
Zero breaking changes - original behavior preserved

Result

Fair model comparison regardless of batch_size
Consistent evaluation metrics across configurations
Easy migration path for maintainers
Backward compatibility maintained

Testing: Verified identical models produce consistent scores across different batch sizes with the fixed evaluation function.

…s different batch sizes. Introduced a new factory for evaluation functions that maintains backward compatibility while providing fair model comparisons. Verified consistent scores across configurations.

TangChao729 added 7 commits August 18, 2025 13:29

Mainrun auto checkpoint

921ec13

Mainrun auto checkpoint

d089ca0

Mainrun auto checkpoint

c6e52c9

Mainrun auto checkpoint

0571343

Mainrun auto checkpoint

fd3306b

Mainrun auto checkpoint

139964c

Fix evaluation function to ensure consistent validation metrics acros…

fb04157

…s different batch sizes. Introduced a new factory for evaluation functions that maintains backward compatibility while providing fair model comparisons. Verified consistent scores across configurations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/evaluation coverage consistency #2

Fix/evaluation coverage consistency #2

Uh oh!

TangChao729 commented Aug 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix/evaluation coverage consistency #2

Are you sure you want to change the base?

Fix/evaluation coverage consistency #2

Uh oh!

Conversation

TangChao729 commented Aug 18, 2025

Fix: Evaluation Coverage Inconsistency Across Batch Sizes

Problem

Root Cause

Solution

Implementation

Result

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant