Split juror data into train/test/validation datasets

Now that juror data is coming in (!!) we need to figure out how to split it into datasets for training, testing, and final validation.

The goal is to split the jury data into three equal-sized datasets of training, testing (public leaderboard), and validation (private leaderboard)

In the final week of the contest, the testing dataset and public leaderboard will be removed, with half the test dataset being made available as additional training data, and the other half rolled into the validation dataset. The goal here is to make greater use of limited data, making more available for both training and final validation.

## High-level approach

Some considerations:

- Do we treat the individual judgments as iid and thus divide them at the individual judgment level?
- Or, do we treat _projects as a whole_ as the unit for modeling, and thus stratify the data sets at the project level?

After some discussion with @devansh76, @daviddao, and others, we are leaning towards a per-project stratification. The argument is that if a project appears even once in the training data, it will provide "mutual information" with the training and test data, correlating the datasets and limiting our ability to accurately evaluate the models. As such, we should ensure that the datasets are stratified at the project level.

This raises a technical consideration: if a judgment pair includes one project in the "training" dataset, and one project in the "test" dataset, to which dataset is the judgment assigned? The answer is to establish a hierarchy of datasets, through the following example:

```
repo a - training
repo b - public test
repo c - private test
repo d - training

repo a vs repo d = goes in training
repo a vs repo b = goes in public test, until a week before the deadline when it would be transferred to training
repo a vs repo c = goes in private test
repo b vs repo c = goes in public test, until a week before the deadline when it would be transferred to private test
```

## Specific implementation

Given a `csv` of training data, we should stratify the data by project, using each project's numerical `index`. We can take the index and sort it into the three dataset categories using a simple `ix % 3` operation. Then, we should iterate over each judgment and assign it to a dataset, before exporting the results into three new `.csv` files. All of this can be done straightforwardly with python scripts.

An open question is whether the the sorting of the projects into categories should be made public, or somehow obscured. Making it public would be simpler, as the entire script could be committed to the repository.

Would welcome discussion / questions about this propsoal.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Split juror data into train/test/validation datasets #6

High-level approach

Specific implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Split juror data into train/test/validation datasets #6

Description

High-level approach

Specific implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions