Skip to content

Latest commit

 

History

History
279 lines (197 loc) · 5.69 KB

File metadata and controls

279 lines (197 loc) · 5.69 KB

Parallel Data API Reference

Core Functions

run_enrichment(config_file: str | Path) -> None

Run data enrichment using a YAML configuration file.

Parameters:

  • config_file: Path to YAML configuration file

Raises:

  • FileNotFoundError: If config file doesn't exist
  • ValueError: If config is invalid
  • NotImplementedError: If source type is not supported

Example:

from parallel_web_tools import run_enrichment

run_enrichment("configs/my_enrichment.yaml")

run_enrichment_from_dict(config: dict) -> None

Run data enrichment using a configuration dictionary.

Parameters:

  • config: Configuration dictionary matching YAML schema

Raises:

  • ValueError: If config is invalid
  • NotImplementedError: If source type is not supported

Example:

from parallel_web_tools import run_enrichment_from_dict

config = {
    "source": "data.csv",
    "target": "enriched.csv",
    "source_type": "csv",
    "source_columns": [
        {"name": "company", "description": "Company name"}
    ],
    "enriched_columns": [
        {"name": "revenue", "description": "Annual revenue"}
    ]
}

run_enrichment_from_dict(config)

Schema Classes

SourceType(Enum)

Enumeration of supported data source types.

Values:

  • SourceType.CSV: CSV file source
  • SourceType.DUCKDB: DuckDB database source
  • SourceType.BIGQUERY: Google BigQuery source

Example:

from parallel_web_tools import SourceType

source_type = SourceType.CSV

Column(dataclass)

Represents a column with name and description.

Attributes:

  • name (str): Column name
  • description (str): Column description

Example:

from parallel_web_tools import Column

col = Column("revenue", "Annual revenue in USD")

InputSchema(dataclass)

Schema for input data configuration.

Attributes:

  • source (str): Source location (file path, table name)
  • target (str): Target location
  • source_type (SourceType): Type of data source
  • source_columns (list[Column]): Input columns
  • enriched_columns (list[Column]): Columns to enrich

Example:

from parallel_web_tools import InputSchema, Column, SourceType

schema = InputSchema(
    source="data.csv",
    target="enriched.csv",
    source_type=SourceType.CSV,
    source_columns=[Column("company", "Company name")],
    enriched_columns=[Column("revenue", "Annual revenue")]
)

Utility Functions

load_schema(filename: str) -> dict

Load schema from YAML file.

Parameters:

  • filename: Path to YAML file

Returns:

  • Dictionary containing schema configuration

Example:

from parallel_web_tools import load_schema

schema_dict = load_schema("config.yaml")

parse_schema(schema: dict) -> InputSchema

Parse schema dictionary into InputSchema object.

Parameters:

  • schema: Schema dictionary

Returns:

  • InputSchema object

Raises:

  • ParseError: If schema is invalid

Example:

from parallel_web_tools import parse_schema

schema_dict = {
    "source": "data.csv",
    "target": "enriched.csv",
    "source_type": "csv",
    "source_columns": [{"name": "company", "description": "Company name"}],
    "enriched_columns": [{"name": "revenue", "description": "Annual revenue"}]
}

schema = parse_schema(schema_dict)

Processor Functions (Advanced)

For direct access to processors (advanced usage):

process_csv(schema: InputSchema) -> None

Process CSV file and enrich data.

process_duckdb(schema: InputSchema) -> None

Process DuckDB table and enrich data.

process_bigquery(schema: InputSchema) -> None

Process BigQuery table and enrich data.

Example:

from parallel_web_tools import InputSchema, Column, SourceType
from parallel_web_tools.processors import process_csv

schema = InputSchema(
    source="data.csv",
    target="enriched.csv",
    source_type=SourceType.CSV,
    source_columns=[Column("company", "Company name")],
    enriched_columns=[Column("revenue", "Annual revenue")]
)

process_csv(schema)

Configuration Schema

YAML Configuration Format

source: path/to/source  # File path or table name
target: path/to/target  # Output location
source_type: csv  # One of: csv, duckdb, bigquery

source_columns:
  - name: column_name
    description: Column description

enriched_columns:
  - name: new_column_name
    description: Description of enriched column

Dictionary Configuration Format

{
    "source": "path/to/source",
    "target": "path/to/target",
    "source_type": "csv",  # or "duckdb", "bigquery"
    "source_columns": [
        {"name": "column_name", "description": "Column description"}
    ],
    "enriched_columns": [
        {"name": "new_column", "description": "Description"}
    ]
}

Environment Variables

Required environment variables:

  • PARALLEL_API_KEY: Your Parallel API key (required)
  • DUCKDB_FILE: Path to DuckDB file (optional, default: data/file.db)
  • BIGQUERY_PROJECT: Google Cloud Project ID for BigQuery (optional)

Load from .env.local:

from dotenv import load_dotenv
load_dotenv(".env.local")

Error Handling

All functions may raise standard Python exceptions:

  • FileNotFoundError: Config or data file not found
  • ValueError: Invalid configuration
  • NotImplementedError: Unsupported source type
  • ParseError: Schema parsing failed

Example with error handling:

from parallel_web_tools import run_enrichment, ParseError

try:
    run_enrichment("config.yaml")
except FileNotFoundError:
    print("Config file not found")
except ParseError as e:
    print(f"Invalid configuration: {e}")
except Exception as e:
    print(f"Enrichment failed: {e}")