A cozy, comfortable data movement framework that makes data feel at home.
hygge (pronounced "hoo-ga") is a Danish word representing comfort, coziness, and well-being. This framework brings those qualities to data movement:
- Comfort: You should relax while you move some tables.
- Simplicity: Clean, intuitive APIs that feel natural
- Reliability: Robust, predictable behavior without surprises
- Flow: Smooth, efficient movement without friction
hygge is built on Polars with PyArrow backend for optimal data movement performance. This combination provides:
- Efficient columnar memory format for large datasets
- Automatic batching and streaming capabilities
- Broad database compatibility via SQLAlchemy
- Zero-copy operations where possible
- Clean, intuitive API that feels natural
We chose Polars because it provides the best balance of performance, developer experience, and compatibility for extract-and-load workflows.
Get started with hygge in three simple steps. We'll walk you through each one.
Create a new hygge project with example configuration:
hygge init my-project
cd my-projectThis creates a cozy, comfortable project structure:
my-project/
├── hygge.yml # Project-level configuration
└── flows/ # Flow definitions
└── example_flow/
├── flow.yml # Flow configuration (home, store, defaults)
└── entities/ # Entity definitions
└── users.yml
What you get:
hygge.yml- Project configuration file with example settingsflows/example_flow/flow.yml- A complete example flow configurationflows/example_flow/entities/users.yml- An example entity definition
Edit flows/example_flow/flow.yml to point to your data sources and destinations:
name: "example_flow"
home:
type: "parquet"
path: "data/source" # Where your source data lives
store:
type: "parquet"
path: "data/destination" # Where data should be written
defaults:
key_column: "id" # Primary key for watermark tracking
batch_size: 10000 # Rows per batchUnderstanding the flow:
home- Your data source (where data starts its journey)store- Your data destination (where data settles)defaults- Settings that apply to all entities in this flow
Entities are the specific tables or datasets you want to move. Edit flows/example_flow/entities/users.yml:
name: "users"
columns:
- id
- name
- email
- created_atFor multiple entities, create additional files in flows/example_flow/entities/:
users.ymlorders.ymlproducts.yml
Each entity will run in parallel when you execute the flow.
See what flows would do before executing:
# Quick preview - one line per flow
hygge go --dry-run
# Detailed preview with full configuration
hygge go --dry-run --verboseWhat this shows:
- Which flows/entities would run
- Source → destination mapping
- Incremental vs full load mode
- Configuration warnings
Example output:
🏡 hygge dry-run preview
Would run 2 flow(s)
✓ example_flow_users parquet → parquet (incremental)
✓ example_flow_orders parquet → parquet (full load)
📊 Summary:
✓ 2 flow(s) configured
💡 Next steps:
• Test connections: hygge debug
• Run flows: hygge go
Test that hygge can connect to your sources and destinations:
hygge debugWhat this checks:
- Configuration file validity
- Database connections (if configured)
- File paths exist and are accessible
- Provides actionable guidance for any issues
Execute all flows in your project:
hygge go # Run all flowsCommon options:
--flow NAME- Run specific flow(s)--incremental- Append data instead of truncating--verbose- Detailed progress information
See the CLI Commands section for complete usage details.
What happens:
- All flows run in parallel
- Each entity processes independently
- Progress is shown for each flow
- Results are logged with success/failure status
Expected output:
Starting all flows...
[1 of 2] FINISHED flow example_flow completed in 2.3s (1,234 rows)
[2 of 2] FINISHED flow another_flow completed in 4.5s (5,678 rows)
Finished running 2 flows in 6.80 seconds.
Completed successfully
Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2
Total rows processed: 6,912
Overall rate: 1,016 rows/s
All flows completed successfully!
Once your first flow is working:
- Add more entities - Create additional entity files in
flows/example_flow/entities/ - Connect to databases - Configure SQL Server or other database connections in
hygge.yml - Use cloud storage - Configure Azure Data Lake or Microsoft Fabric destinations
- Enable incremental processing - Set up watermarks for efficient incremental loads
See the Common Workflows section below for more detailed examples.
Create a cozy new hygge project:
hygge init my-project # Create new project
hygge init my-project --flows-dir pipelines # Custom flows directory
hygge init my-project --force # Overwrite existing projectSets up a comfortable project structure with example configuration files to get you started.
Let your data flow comfortably through all your flows:
# Run all flows
hygge go
# Run specific flows (comma-separated)
hygge go --flow users_to_lake,orders_to_lake
# Or use multiple flags
hygge go --flow users_to_lake --flow orders_to_lake
# Run specific entities within a flow (comma-separated)
hygge go --entity users_to_lake.users,users_to_lake.orders
# Or use multiple flags
hygge go --entity users_to_lake.users --entity users_to_lake.orders
# Override run type (incremental or full-drop)
hygge go --incremental # Append data instead of truncating
hygge go --full-drop # Truncate destination before loading
# Override flow configuration
hygge go --var flow.mssql_to_mirrored_db.full_drop=true
# Control concurrency
hygge go --concurrency 4
# Enable verbose logging
hygge go --verboseFlow filtering:
--flowaccepts base flow names (e.g.,users_to_lake) or entity flow names (e.g.,users_to_lake_users)--entityuses formatflow.entity(e.g.,users_to_lake.users)- Both support comma-separated values:
--flow flow1,flow2,flow3OR multiple flags:--flow flow1 --flow flow2
Run type overrides:
--incremental: Append data to destination (uses journal + watermarks)--full-drop: Truncate destination before loading- Cannot specify both flags together
Flow overrides (--var):
- Format:
flow.<flow_name>.field=value - Example:
flow.users_to_lake.store.compression=snappy - Supports nested field paths for advanced overrides
Runs all flows in parallel, keeping you informed with cozy progress updates and results.
Take a cozy look at your project configuration:
hygge debug # Shows project details and discovered flowsUse this to make sure everything feels right before running your flows.
-
Comfort Over Complexity
- APIs should feel natural and intuitive
- Configuration should be simple but flexible
- Defaults should "just work"
-
Flow Over Force
- Data should move smoothly between systems
- Batching and buffering should happen naturally
- Progress should be visible but unobtrusive
-
Reliability Over Speed
- Prefer robust, predictable behavior
- Handle errors gracefully
- Make recovery simple
-
Clarity Over Cleverness
- Simple, clear code over complex optimizations
- Explicit configuration over implicit behavior
- Clear logging and progress tracking
hygge organizes data movement around three simple concepts:
- Home: Where data feels at home (source - database, parquet files, etc.)
- Store: Where data settles comfortably (destination - parquet, Azure, Fabric, etc.)
- Flow: The cozy journey from home to store (configured in YAML, executed via CLI)
All configuration happens in YAML files. You define flows, and hygge handles the execution:
# Define your flows in YAML, then run them
hygge goSee the samples/ directory for complete configuration examples.
Where your data feels at home:
Parquet Files:
home:
type: parquet
path: data/sourceMS SQL Server:
Configure connections in hygge.yml:
# hygge.yml
connections:
my_database:
type: mssql
server: myserver.database.windows.net
database: mydatabase
pool_size: 8Then define flows in flows/<flow_name>/flow.yml:
# flows/users_flow/flow.yml
name: users_flow
home:
type: mssql
connection: my_database
table: dbo.users
store:
type: parquet
path: data/usersFeatures:
- Azure AD authentication (Managed Identity, Azure CLI, Service Principal)
- Connection pooling for efficient concurrent access
- Entity pattern for extracting 10-200+ tables
- Watermark-aware incremental reads
Prerequisites:
- ODBC Driver 18 for SQL Server (
brew install msodbcsql18on macOS) - Azure AD authentication configured
Where your data settles comfortably:
Parquet Files:
store:
type: parquet
path: data/destinationMicrosoft Fabric Open Mirroring:
store:
type: open_mirroring
account_url: https://onelake.dfs.fabric.microsoft.com
filesystem: my-workspace
mirror_name: my-mirror
key_columns: ["id"]For full_drop runs, hygge deletes the LandingZone folder to trigger Open Mirroring to drop the table, then waits for Open Mirroring to process the deletion before writing new data. The wait time is configurable:
store:
type: open_mirroring
folder_deletion_wait_seconds: 180 # Wait 3 minutes (default: 120s)This can also be set per entity for tables that need more time:
entities:
- name: LargeTable
store:
folder_deletion_wait_seconds: 300 # 5 minutes for large tableAzure Data Lake Storage (ADLS Gen2):
store:
type: adls
account_url: https://mystorage.dfs.core.windows.net
filesystem: my-container
credential: managed_identitySee samples/ directory for complete configuration examples.
A cozy workflow to get your data moving:
- Create connection configuration in
hygge.yml:
connections:
my_database:
type: mssql
server: myserver.database.windows.net
database: mydatabase
pool_size: 8- Define your flow in
flows/users_to_parquet/flow.yml:
name: users_to_parquet
home:
type: mssql
connection: my_database
table: dbo.users
store:
type: parquet
path: data/users- Run it:
hygge goGet your data comfortably settled in Fabric:
- Configure Open Mirroring store in your flow:
# flows/my_flow/flow.yml
name: my_flow
store:
type: open_mirroring
account_url: https://onelake.dfs.fabric.microsoft.com
filesystem: my-workspace
mirror_name: my-mirror
key_columns: ["id"]- Run your flow:
hygge gohygge automatically handles all the cozy details - metadata files, schema manifests, and atomic operations - so your data feels right at home in Fabric.
For full_drop runs, hygge safely extracts data to a staging area first, then deletes the LandingZone folder (triggering Open Mirroring to drop the table), waits for Open Mirroring to process (~2 minutes by default), and moves the new data into place. If extraction fails, the existing table is untouched.
# flows/my_flow/flow.yml
run_type: full_drop
store:
type: open_mirroring
account_url: https://onelake.dfs.fabric.microsoft.com
filesystem: my-workspace
mirror_name: my-mirror
key_columns: ["id"]
folder_deletion_wait_seconds: 120 # Default: 2 minutesCreate your own cozy homes and stores by implementing the Home and Store interfaces. hygge automatically discovers and welcomes them, making them feel right at home in your YAML configurations.
hygge flows now coordinate run strategy with the store. By default the store follows the flow’s run_type (incremental vs full_drop). You can override that behaviour per store:
store:
type: open_mirroring
account_url: https://onelake.dfs.fabric.microsoft.com
filesystem: my-workspace
mirror_name: my-mirror
incremental: false # force truncate-and-reload even if the flow is incrementalFlow run_type |
Store incremental |
Behaviour |
|---|---|---|
incremental |
omitted / null |
Append via journal + watermark |
incremental |
true |
Append (explicit opt-in) |
incremental |
false |
Force truncate-and-reload each run |
full_drop |
omitted / null |
Truncate destination before reload |
full_drop |
true |
Force append even on full-drop runs (use with care) |
full_drop |
false |
Truncate (explicit opt-in) |
This alignment keeps the flow, store, and journal in sync and prevents accidental mixes of append/truncate semantics.
hygge runs multiple entity flows in parallel, controlled by the concurrency setting:
# hygge.yml
options:
concurrency: 8 # Up to 8 flows run simultaneously# Or override from CLI
hygge go --concurrency 4Smart concurrency for full_drop flows: When a full_drop flow finishes extracting data, it releases its concurrency slot immediately — even if it's still waiting for Open Mirroring to process a folder deletion. This means other entities can start extracting data during that wait time instead of sitting idle. For projects with many entities, this can save significant time per run.
- Keep it simple and cozy
- Make common tasks feel effortless
- Make complex tasks comfortable
- Prioritize a warm, welcoming user experience
- Write clear, maintainable code that feels good to read
- Test thoroughly but sensibly
hygge isn't just about moving data - it's about making data movement feel natural, comfortable, and reliable. Like a warm blanket for your data pipelines.