-
Notifications
You must be signed in to change notification settings - Fork 7
Description
Now that juror data is coming in (!!) we need to figure out how to split it into datasets for training, testing, and final validation.
The goal is to split the jury data into three equal-sized datasets of training, testing (public leaderboard), and validation (private leaderboard)
In the final week of the contest, the testing dataset and public leaderboard will be removed, with half the test dataset being made available as additional training data, and the other half rolled into the validation dataset. The goal here is to make greater use of limited data, making more available for both training and final validation.
High-level approach
Some considerations:
- Do we treat the individual judgments as iid and thus divide them at the individual judgment level?
- Or, do we treat projects as a whole as the unit for modeling, and thus stratify the data sets at the project level?
After some discussion with @devansh76, @daviddao, and others, we are leaning towards a per-project stratification. The argument is that if a project appears even once in the training data, it will provide "mutual information" with the training and test data, correlating the datasets and limiting our ability to accurately evaluate the models. As such, we should ensure that the datasets are stratified at the project level.
This raises a technical consideration: if a judgment pair includes one project in the "training" dataset, and one project in the "test" dataset, to which dataset is the judgment assigned? The answer is to establish a hierarchy of datasets, through the following example:
repo a - training
repo b - public test
repo c - private test
repo d - training
repo a vs repo d = goes in training
repo a vs repo b = goes in public test, until a week before the deadline when it would be transferred to training
repo a vs repo c = goes in private test
repo b vs repo c = goes in public test, until a week before the deadline when it would be transferred to private test
Specific implementation
Given a csv of training data, we should stratify the data by project, using each project's numerical index. We can take the index and sort it into the three dataset categories using a simple ix % 3 operation. Then, we should iterate over each judgment and assign it to a dataset, before exporting the results into three new .csv files. All of this can be done straightforwardly with python scripts.
An open question is whether the the sorting of the projects into categories should be made public, or somehow obscured. Making it public would be simpler, as the entire script could be committed to the repository.
Would welcome discussion / questions about this propsoal.