Data Engineer Take-Home Assessment

Company: SafeStreets Level: Mid / Senior Expected Time: 2–4 hours Submission: GitHub repository or CodeSubmit Required Tools: dbt (Data Build Tool)

Scenario

SafeStreets collects event data from devices and user actions. Events arrive as raw data and must be transformed into analytics-ready tables using dbt.

You are given a raw dataset (seeds/raw_events.csv) containing:

Missing fields
Duplicate records
Inconsistent timestamps
Evolving schema (some events have fields others don't)

Your task is to build a dbt project that cleans this data and creates a dimensional model. Please do not use AI to write code for you in this project. We want to get as much YOU as possible. Please do not spend any more than 4 hours on this project. Focus on senior level work and complete as much as you can in that timeframe.

Tasks

Task 1 — Data Ingestion & Cleaning (Required)

Goal: Handle messy, real-world data using dbt.

Implement the staging model models/staging/stg_events.sql to:

Normalize timestamps to a consistent UTC format
Deduplicate events (hint: check event_id)
Handle missing or invalid fields
Standardize data (e.g., case-sensitive user_ids)
Filter or flag invalid records

Deliverable: Complete models/staging/stg_events.sql

Task 2 — Data Modeling (Required)

Goal: Design analytics-ready tables using dimensional modeling.

Implement the mart models in models/marts/:

fact_events.sql — Central fact table for event analytics
dim_devices.sql — Device dimension
dim_users.sql — User dimension (optional but recommended)

Deliverable: Complete the models in models/marts/

Task 3 — Analytics Queries (Required)

Goal: Write SQL to answer 2–3 of the following questions.

Daily active devices
Events per device per day
Top event types in the last 7 days
Average events per user
Event count trends over time

Focus on:

Correct joins between fact and dimension tables
Readability (use CTEs where appropriate)
Performance considerations

Deliverable: Queries in analyses/analytics_queries.sql

Project Structure

├── README.md                       # This file (add your answers below)
├── dbt_project.yml                 # dbt project configuration
├── profiles.yml.example            # Example dbt profile (copy to ~/.dbt/)
├── seeds/
│   └── raw_events.csv              # Raw input data (loaded via dbt seed)
├── models/
│   ├── staging/
│   │   ├── _staging__sources.yml   # Source definitions
│   │   ├── _staging__models.yml    # Model documentation & tests
│   │   └── stg_events.sql          # YOUR WORK: Staging model
│   └── marts/
│       ├── _marts__models.yml      # Model documentation & tests
│       ├── fact_events.sql         # YOUR WORK: Fact table
│       ├── dim_devices.sql         # YOUR WORK: Device dimension
│       └── dim_users.sql           # YOUR WORK: User dimension
├── analyses/
│   └── analytics_queries.sql       # YOUR WORK: Analytics queries
├── tests/                          # Custom data tests (optional)
└── macros/                         # Custom macros (optional)

Getting Started

Prerequisites

Python 3.8+
dbt-core with a database adapter

Recommended setup (DuckDB - no external database required):

    pip install dbt-core dbt-duckdb

Alternative adapters:

    pip install dbt-postgres   # For PostgreSQL
    pip install dbt-sqlite     # For SQLite

Setup

Clone this repository
Copy the example profile and configure:
```
cp profiles.yml.example ~/.dbt/profiles.yml
```
Or set DBT_PROFILES_DIR to this directory.
Verify dbt can connect:
```
dbt debug
```
Load the seed data:
```
dbt seed
```
Run your models:
```
dbt run
```
Run tests:
```
dbt test
```

Useful Commands

    dbt seed                    # Load raw_events.csv into database
    dbt run                     # Build all models
    dbt run --select staging    # Build only staging models
    dbt run --select marts      # Build only mart models
    dbt test                    # Run all tests
    dbt compile                 # Compile SQL without running (useful for analyses)
    dbt docs generate           # Generate documentation
    dbt docs serve              # View documentation in browser

Raw Data Issues

The seeds/raw_events.csv contains intentional data quality issues for you to handle:

Issue	Examples
Duplicate records	`evt_001` and `evt_016` appear twice
Missing fields	Some events missing `device_id` or `user_id`
Inconsistent timestamps	ISO 8601, Unix epoch, US format, European format
Invalid data	Empty `event_id`, null `device_id`, invalid timestamp
Case inconsistency	`usr_100` vs `USR_100`
Evolving schema	Some events have `location` or `device_metadata`

Evaluation Criteria

Criteria	Weight	What We're Looking For
Correctness	30%	Does the solution work? Are edge cases handled?
Data Modeling	25%	Is the schema well-designed for analytics?
dbt Best Practices	20%	Proper use of staging/marts, tests, documentation
Code Quality	15%	Readable, maintainable SQL and project organization
Production Thinking	10%	Monitoring, error handling, scalability considerations

Your Submission

Instructions: Complete the sections below with your work.

How to Run

# Your commands here

Notes

Please do not use cloud services or external accounts
Please do not use AI to write your solutions for this project
PLEASE DO NOT SPEND MORE THAN 4 HOURS ON THIS PROJECT
If you run out of time, document what you would have done next
Questions? Email your recruiter or hiring manager

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineer Take-Home Assessment

Scenario

Tasks

Task 1 — Data Ingestion & Cleaning (Required)

Task 2 — Data Modeling (Required)

Task 3 — Analytics Queries (Required)

Project Structure

Getting Started

Prerequisites

Setup

Useful Commands

Raw Data Issues

Evaluation Criteria

Your Submission

How to Run

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
analyses		analyses
macros		macros
models		models
seeds		seeds
tests		tests
.gitignore		.gitignore
README.md		README.md
dbt_project.yml		dbt_project.yml
profiles.yml.example		profiles.yml.example

Folders and files

Latest commit

History

Repository files navigation

Data Engineer Take-Home Assessment

Scenario

Tasks

Task 1 — Data Ingestion & Cleaning (Required)

Task 2 — Data Modeling (Required)

Task 3 — Analytics Queries (Required)

Project Structure

Getting Started

Prerequisites

Setup

Useful Commands

Raw Data Issues

Evaluation Criteria

Your Submission

How to Run

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages