Fix/evaluation coverage consistency #2
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fix: Evaluation Coverage Inconsistency Across Batch Sizes
Problem
The current
evaluate()function produces inconsistent validation loss based onbatch_sizeconfiguration, making model comparisons unfair. Models with different batch sizes evaluate different amounts of validation data but use the same normalization denominator.Root Cause
iter_full_split()creates non-overlapping windows of sizebatch_size × block_size + 1floor((len(val_ids) - span) / span) + 1sum(token_losses) / len(val_text)(characters)Example:
when (batch_size × block_size + 1) / len(val_text) < 2: 1 window → artificially low losswhen (batch_size × block_size + 1) / len(val_text) > 2: : more then 2 window → higher loss for identical modelSolution
Added
create_evaluation_functions()factory that provides:evaluate_char_normalized) - unchanged for compatibilityevaluate_token_average) - consistent per-token normalizationKey fix:
sum(token_losses) / total_tokens_evaluatedinstead of character countImplementation
total_tokens += yb.numel()sum_nll / max(1, total_tokens)Result
Testing: Verified identical models produce consistent scores across different batch sizes with the fixed evaluation function.