Company: SafeStreets Level: Mid / Senior Expected Time: 2–4 hours Submission: GitHub repository or CodeSubmit Required Tools: dbt (Data Build Tool)
SafeStreets collects event data from devices and user actions. Events arrive as raw data and must be transformed into analytics-ready tables using dbt.
You are given a raw dataset (seeds/raw_events.csv) containing:
- Missing fields
- Duplicate records
- Inconsistent timestamps
- Evolving schema (some events have fields others don't)
Your task is to build a dbt project that cleans this data and creates a dimensional model. Please do not use AI to write code for you in this project. We want to get as much YOU as possible. Please do not spend any more than 4 hours on this project. Focus on senior level work and complete as much as you can in that timeframe.
Goal: Handle messy, real-world data using dbt.
Implement the staging model models/staging/stg_events.sql to:
- Normalize timestamps to a consistent UTC format
- Deduplicate events (hint: check
event_id) - Handle missing or invalid fields
- Standardize data (e.g., case-sensitive user_ids)
- Filter or flag invalid records
Deliverable: Complete models/staging/stg_events.sql
Goal: Design analytics-ready tables using dimensional modeling.
Implement the mart models in models/marts/:
fact_events.sql— Central fact table for event analyticsdim_devices.sql— Device dimensiondim_users.sql— User dimension (optional but recommended)
Deliverable: Complete the models in models/marts/
Goal: Write SQL to answer 2–3 of the following questions.
- Daily active devices
- Events per device per day
- Top event types in the last 7 days
- Average events per user
- Event count trends over time
Focus on:
- Correct joins between fact and dimension tables
- Readability (use CTEs where appropriate)
- Performance considerations
Deliverable: Queries in analyses/analytics_queries.sql
├── README.md # This file (add your answers below)
├── dbt_project.yml # dbt project configuration
├── profiles.yml.example # Example dbt profile (copy to ~/.dbt/)
├── seeds/
│ └── raw_events.csv # Raw input data (loaded via dbt seed)
├── models/
│ ├── staging/
│ │ ├── _staging__sources.yml # Source definitions
│ │ ├── _staging__models.yml # Model documentation & tests
│ │ └── stg_events.sql # YOUR WORK: Staging model
│ └── marts/
│ ├── _marts__models.yml # Model documentation & tests
│ ├── fact_events.sql # YOUR WORK: Fact table
│ ├── dim_devices.sql # YOUR WORK: Device dimension
│ └── dim_users.sql # YOUR WORK: User dimension
├── analyses/
│ └── analytics_queries.sql # YOUR WORK: Analytics queries
├── tests/ # Custom data tests (optional)
└── macros/ # Custom macros (optional)
- Python 3.8+
- dbt-core with a database adapter
Recommended setup (DuckDB - no external database required):
pip install dbt-core dbt-duckdbAlternative adapters:
pip install dbt-postgres # For PostgreSQL
pip install dbt-sqlite # For SQLite-
Clone this repository
-
Copy the example profile and configure:
cp profiles.yml.example ~/.dbt/profiles.ymlOr set
DBT_PROFILES_DIRto this directory. -
Verify dbt can connect:
dbt debug
-
Load the seed data:
dbt seed
-
Run your models:
dbt run
-
Run tests:
dbt test
dbt seed # Load raw_events.csv into database
dbt run # Build all models
dbt run --select staging # Build only staging models
dbt run --select marts # Build only mart models
dbt test # Run all tests
dbt compile # Compile SQL without running (useful for analyses)
dbt docs generate # Generate documentation
dbt docs serve # View documentation in browserThe seeds/raw_events.csv contains intentional data quality issues for you to handle:
| Issue | Examples |
|---|---|
| Duplicate records | evt_001 and evt_016 appear twice |
| Missing fields | Some events missing device_id or user_id |
| Inconsistent timestamps | ISO 8601, Unix epoch, US format, European format |
| Invalid data | Empty event_id, null device_id, invalid timestamp |
| Case inconsistency | usr_100 vs USR_100 |
| Evolving schema | Some events have location or device_metadata |
| Criteria | Weight | What We're Looking For |
|---|---|---|
| Correctness | 30% | Does the solution work? Are edge cases handled? |
| Data Modeling | 25% | Is the schema well-designed for analytics? |
| dbt Best Practices | 20% | Proper use of staging/marts, tests, documentation |
| Code Quality | 15% | Readable, maintainable SQL and project organization |
| Production Thinking | 10% | Monitoring, error handling, scalability considerations |
Instructions: Complete the sections below with your work.
# Your commands here- Please do not use cloud services or external accounts
- Please do not use AI to write your solutions for this project
- PLEASE DO NOT SPEND MORE THAN 4 HOURS ON THIS PROJECT
- If you run out of time, document what you would have done next
- Questions? Email your recruiter or hiring manager