-
Notifications
You must be signed in to change notification settings - Fork 5
Open
Description
OSO Funding-Based Training Dataset
This dataset contains pairwise comparisons between open source projects, where the weights are derived from their relative funding amounts. Here's how it was created:
Data Sources
- Project funding data from OSO's BigQuery tables (
oss_funding_v0) - Repository information from OSO's BigQuery tables (
repositories_v0) - Dependency graph containing repository URLs (
unweighted_graph.json)
Process
-
Graph Loading:
- Loaded dependency graph from JSON
- Extracted all repository URLs for analysis
-
Funding Data Collection: For each project, we collected:
- Quarterly funding amounts
- Funder and grant pool information
- Project names and IDs
- Associated GitHub repository URLs
-
Comparison Generation: For each funding round (defined by funder + quarter):
- Found all projects that received funding (minimum 2 projects per round)
- Generated all possible pairs using
itertools.combinations - Calculated relative weights:
weight_a = amount_a / (amount_a + amount_b) weight_b = 1 - weight_a # Ensures weights sum to 1.0
-
Deduplication:
- Project pairs are stored consistently (alphabetically ordered URLs)
- When the same pair appears in multiple rounds, weights are averaged
- Final weights maintained to sum to 1.0
Output Files
The process generates several CSV files:
funding-data.csv: Raw funding datatraining-data-preagg.csv: All pairwise comparisons before deduplicationtraining-data.csv: Final deduplicated pairwise comparisons (this is what is used for the competition)training-data-by-dependent-node.csv: Filtered comparisons for projects sharing dependencies
Data Format
The final deduplicated CSV contains:
project_a: GitHub repository URLproject_b: GitHub repository URLweight_a: Average relative funding weight for project_aweight_b: Average relative funding weight for project_b
Key Assumptions
- Projects are identified primarily by their GitHub repository URLs
- Only rounds with 2+ projects generate comparisons
- All funding amounts are in USD
- No time-based weighting within quarters
- For projects with multiple repositories, we use the one with most stars
- Weights are relative within each funding round before averaging
Example
If Project A received $75 and Project B received $25 in a funding round:
{
"project_a": "https://github.com/projectA",
"project_b": "https://github.com/projectB",
"weight_a": 0.75, # (75/100)
"weight_b": 0.25 # (25/100)
}Note: The notebook also includes functionality to filter comparisons based on shared dependencies in the graph, available in the training-data-by-dependent-node.csv output.
davidgasquez and rohitmalekar
Metadata
Metadata
Assignees
Labels
No labels