Skip to content

Creating a training set for ML competitions #13

@ccerv1

Description

@ccerv1

OSO Funding-Based Training Dataset

This dataset contains pairwise comparisons between open source projects, where the weights are derived from their relative funding amounts. Here's how it was created:

Data Sources

  • Project funding data from OSO's BigQuery tables (oss_funding_v0)
  • Repository information from OSO's BigQuery tables (repositories_v0)
  • Dependency graph containing repository URLs (unweighted_graph.json)

Process

  1. Graph Loading:

    • Loaded dependency graph from JSON
    • Extracted all repository URLs for analysis
  2. Funding Data Collection: For each project, we collected:

    • Quarterly funding amounts
    • Funder and grant pool information
    • Project names and IDs
    • Associated GitHub repository URLs
  3. Comparison Generation: For each funding round (defined by funder + quarter):

    • Found all projects that received funding (minimum 2 projects per round)
    • Generated all possible pairs using itertools.combinations
    • Calculated relative weights:
    weight_a = amount_a / (amount_a + amount_b)
    weight_b = 1 - weight_a  # Ensures weights sum to 1.0
  4. Deduplication:

    • Project pairs are stored consistently (alphabetically ordered URLs)
    • When the same pair appears in multiple rounds, weights are averaged
    • Final weights maintained to sum to 1.0

Output Files

The process generates several CSV files:

  1. funding-data.csv: Raw funding data
  2. training-data-preagg.csv: All pairwise comparisons before deduplication
  3. training-data.csv: Final deduplicated pairwise comparisons (this is what is used for the competition)
  4. training-data-by-dependent-node.csv: Filtered comparisons for projects sharing dependencies

Data Format

The final deduplicated CSV contains:

  • project_a: GitHub repository URL
  • project_b: GitHub repository URL
  • weight_a: Average relative funding weight for project_a
  • weight_b: Average relative funding weight for project_b

Key Assumptions

  • Projects are identified primarily by their GitHub repository URLs
  • Only rounds with 2+ projects generate comparisons
  • All funding amounts are in USD
  • No time-based weighting within quarters
  • For projects with multiple repositories, we use the one with most stars
  • Weights are relative within each funding round before averaging

Example

If Project A received $75 and Project B received $25 in a funding round:

{
    "project_a": "https://github.com/projectA",
    "project_b": "https://github.com/projectB",
    "weight_a": 0.75,  # (75/100)
    "weight_b": 0.25   # (25/100)
}

Note: The notebook also includes functionality to filter comparisons based on shared dependencies in the graph, available in the training-data-by-dependent-node.csv output.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions