Creating a training set for ML competitions

# OSO Funding-Based Training Dataset

This dataset contains pairwise comparisons between open source projects, where the weights are derived from their relative funding amounts. Here's how it was created:

## Data Sources
- Project funding data from OSO's BigQuery tables (`oss_funding_v0`)
- Repository information from OSO's BigQuery tables (`repositories_v0`)
- Dependency graph containing repository URLs (`unweighted_graph.json`)

## Process
1. **Graph Loading**:
   - Loaded dependency graph from JSON
   - Extracted all repository URLs for analysis

2. **Funding Data Collection**: For each project, we collected:
   - Quarterly funding amounts
   - Funder and grant pool information
   - Project names and IDs
   - Associated GitHub repository URLs

3. **Comparison Generation**: For each funding round (defined by funder + quarter):
   - Found all projects that received funding (minimum 2 projects per round)
   - Generated all possible pairs using `itertools.combinations`
   - Calculated relative weights:
   ```python
   weight_a = amount_a / (amount_a + amount_b)
   weight_b = 1 - weight_a  # Ensures weights sum to 1.0
   ```

4. **Deduplication**:
   - Project pairs are stored consistently (alphabetically ordered URLs)
   - When the same pair appears in multiple rounds, weights are averaged
   - Final weights maintained to sum to 1.0

## Output Files
The process generates several CSV files:
1. `funding-data.csv`: Raw funding data
2. `training-data-preagg.csv`: All pairwise comparisons before deduplication
3. `training-data.csv`: Final deduplicated pairwise comparisons (this is what is used for the competition)
4. `training-data-by-dependent-node.csv`: Filtered comparisons for projects sharing dependencies

## Data Format
The final deduplicated CSV contains:
- `project_a`: GitHub repository URL
- `project_b`: GitHub repository URL
- `weight_a`: Average relative funding weight for project_a
- `weight_b`: Average relative funding weight for project_b

## Key Assumptions
- Projects are identified primarily by their GitHub repository URLs
- Only rounds with 2+ projects generate comparisons
- All funding amounts are in USD
- No time-based weighting within quarters
- For projects with multiple repositories, we use the one with most stars
- Weights are relative within each funding round before averaging

## Example
If Project A received $75 and Project B received $25 in a funding round:
```python
{
    "project_a": "https://github.com/projectA",
    "project_b": "https://github.com/projectB",
    "weight_a": 0.75,  # (75/100)
    "weight_b": 0.25   # (25/100)
}
```

Note: The notebook also includes functionality to filter comparisons based on shared dependencies in the graph, available in the `training-data-by-dependent-node.csv` output.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Creating a training set for ML competitions #13

OSO Funding-Based Training Dataset

Data Sources

Process

Output Files

Data Format

Key Assumptions

Example

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Creating a training set for ML competitions #13

Description

OSO Funding-Based Training Dataset

Data Sources

Process

Output Files

Data Format

Key Assumptions

Example

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions