-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Problem
The problem is that PR #34 (refactor: generalize dataset indexing from language-based to dataset_id-based) introduces API-breaking changes to the task interface and results structure, but the documentation in README.md and CONTRIBUTING.md still references the old API. Once #34 is merged, new contributors following the docs will write code against a stale interface (load_monolingual_data, lang_datasets, language_results, etc.) that no longer exists.
Proposal
-
Type:
- New Ontology (data source for multiple tasks)
- New Task(s)
- New Model(s)
- New Metric(s)
- Other
-
Area(s) of code:
README.md,CONTRIBUTING.md,examples/custom_task_example.py
Update all documentation and examples to reflect the new dataset_id-based API introduced in #34. Specifically:
README.md
-
Checkpointing section (line ~115): Change "saves result checkpoints after each task completion in a specific language" to reflect that checkpointing is now per-dataset (
dataset_id), not per-language. -
Metrics & Aggregation section (lines ~174–181):
- Step 1 currently says "Macro-average languages per task" — update to reflect the new dataset-based aggregation.
- Document the new
aggregation_modeparameter and the three supported modes:monolingual_only(default)crosslingual_group_input_languagescrosslingual_group_output_languages
- Note that
mean_per_languagebehavior now depends on the chosen aggregation mode.
-
Results structure (line ~164): The
checkpoint.jsondescription should mentiondatasetid_resultsinstead of implying language-keyed results.
CONTRIBUTING.md
-
"Adding a New Task" — Step 2 code example (lines ~138–206):
- Rename
load_monolingual_data(self, split, language)→load_dataset(self, dataset_id, split). - Update the
RankingDatasetconstruction accordingly. - Add guidance on the new optional override methods:
languages_to_dataset_ids()andget_dataset_language()(withinput_language/output_languagedistinction). - Briefly explain when a task author would need to override these (multi-dataset per language, cross-lingual, or multilingual tasks).
- Rename
-
"Adding a New Task" — Step 4 test example (line ~234):
- Change
task.lang_datasets[Language.EN]→task.datasets["en"](ortask.datasets[Language.EN.value]with a named variable for clarity, consistent with the review feedback on refactor: generalize dataset indexing from language-based to dataset_id-based #34).
- Change
-
"Adding a New Task" — general: Add a note or subsection explaining the difference between monolingual, cross-lingual, and multilingual dataset scenarios and how the new
dataset_idsystem handles them.
examples/custom_task_example.py
load_monolingual_datamethod (line 81): Rename toload_datasetwith the new(self, dataset_id, split)signature. This file is referenced by both README and CONTRIBUTING as the canonical example.
Additional Context
- PR refactor: generalize dataset indexing from language-based to dataset_id-based #34
- Issue [FEATURE] Generalize Dataset Indexing Within Tasks #33 (motivating feature request)
- Key renames introduced by refactor: generalize dataset indexing from language-based to dataset_id-based #34:
Before (current docs) After (refactor: generalize dataset indexing from language-based to dataset_id-based #34) lang_datasets: dict[Language, Dataset]datasets: dict[str, Dataset]load_monolingual_data(language, split)load_dataset(dataset_id, split)language_resultsdatasetid_results(no aggregation mode) aggregation_modeenum:monolingual_only,crosslingual_group_input_languages,crosslingual_group_output_languages(single language per dataset) get_dataset_language(dataset_id)returns(input_language, output_language)
Implementation
- [x ] I plan to implement this in a PR
- [] I am proposing the idea and would like someone else to pick it up