Run data enrichment using a YAML configuration file.
Parameters:
config_file: Path to YAML configuration file
Raises:
FileNotFoundError: If config file doesn't existValueError: If config is invalidNotImplementedError: If source type is not supported
Example:
from parallel_web_tools import run_enrichment
run_enrichment("configs/my_enrichment.yaml")Run data enrichment using a configuration dictionary.
Parameters:
config: Configuration dictionary matching YAML schema
Raises:
ValueError: If config is invalidNotImplementedError: If source type is not supported
Example:
from parallel_web_tools import run_enrichment_from_dict
config = {
"source": "data.csv",
"target": "enriched.csv",
"source_type": "csv",
"source_columns": [
{"name": "company", "description": "Company name"}
],
"enriched_columns": [
{"name": "revenue", "description": "Annual revenue"}
]
}
run_enrichment_from_dict(config)Enumeration of supported data source types.
Values:
SourceType.CSV: CSV file sourceSourceType.DUCKDB: DuckDB database sourceSourceType.BIGQUERY: Google BigQuery source
Example:
from parallel_web_tools import SourceType
source_type = SourceType.CSVRepresents a column with name and description.
Attributes:
name(str): Column namedescription(str): Column description
Example:
from parallel_web_tools import Column
col = Column("revenue", "Annual revenue in USD")Schema for input data configuration.
Attributes:
source(str): Source location (file path, table name)target(str): Target locationsource_type(SourceType): Type of data sourcesource_columns(list[Column]): Input columnsenriched_columns(list[Column]): Columns to enrich
Example:
from parallel_web_tools import InputSchema, Column, SourceType
schema = InputSchema(
source="data.csv",
target="enriched.csv",
source_type=SourceType.CSV,
source_columns=[Column("company", "Company name")],
enriched_columns=[Column("revenue", "Annual revenue")]
)Load schema from YAML file.
Parameters:
filename: Path to YAML file
Returns:
- Dictionary containing schema configuration
Example:
from parallel_web_tools import load_schema
schema_dict = load_schema("config.yaml")Parse schema dictionary into InputSchema object.
Parameters:
schema: Schema dictionary
Returns:
InputSchemaobject
Raises:
ParseError: If schema is invalid
Example:
from parallel_web_tools import parse_schema
schema_dict = {
"source": "data.csv",
"target": "enriched.csv",
"source_type": "csv",
"source_columns": [{"name": "company", "description": "Company name"}],
"enriched_columns": [{"name": "revenue", "description": "Annual revenue"}]
}
schema = parse_schema(schema_dict)For direct access to processors (advanced usage):
Process CSV file and enrich data.
Process DuckDB table and enrich data.
Process BigQuery table and enrich data.
Example:
from parallel_web_tools import InputSchema, Column, SourceType
from parallel_web_tools.processors import process_csv
schema = InputSchema(
source="data.csv",
target="enriched.csv",
source_type=SourceType.CSV,
source_columns=[Column("company", "Company name")],
enriched_columns=[Column("revenue", "Annual revenue")]
)
process_csv(schema)source: path/to/source # File path or table name
target: path/to/target # Output location
source_type: csv # One of: csv, duckdb, bigquery
source_columns:
- name: column_name
description: Column description
enriched_columns:
- name: new_column_name
description: Description of enriched column{
"source": "path/to/source",
"target": "path/to/target",
"source_type": "csv", # or "duckdb", "bigquery"
"source_columns": [
{"name": "column_name", "description": "Column description"}
],
"enriched_columns": [
{"name": "new_column", "description": "Description"}
]
}Required environment variables:
PARALLEL_API_KEY: Your Parallel API key (required)DUCKDB_FILE: Path to DuckDB file (optional, default:data/file.db)BIGQUERY_PROJECT: Google Cloud Project ID for BigQuery (optional)
Load from .env.local:
from dotenv import load_dotenv
load_dotenv(".env.local")All functions may raise standard Python exceptions:
FileNotFoundError: Config or data file not foundValueError: Invalid configurationNotImplementedError: Unsupported source typeParseError: Schema parsing failed
Example with error handling:
from parallel_web_tools import run_enrichment, ParseError
try:
run_enrichment("config.yaml")
except FileNotFoundError:
print("Config file not found")
except ParseError as e:
print(f"Invalid configuration: {e}")
except Exception as e:
print(f"Enrichment failed: {e}")