Skip to content

SSDevelopment/takeHomeAssessment-dataEngineer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Engineer Take-Home Assessment

Company: SafeStreets Level: Mid / Senior Expected Time: 2–4 hours Submission: GitHub repository or CodeSubmit Required Tools: dbt (Data Build Tool)


Scenario

SafeStreets collects event data from devices and user actions. Events arrive as raw data and must be transformed into analytics-ready tables using dbt.

You are given a raw dataset (seeds/raw_events.csv) containing:

  • Missing fields
  • Duplicate records
  • Inconsistent timestamps
  • Evolving schema (some events have fields others don't)

Your task is to build a dbt project that cleans this data and creates a dimensional model. Please do not use AI to write code for you in this project. We want to get as much YOU as possible. Please do not spend any more than 4 hours on this project. Focus on senior level work and complete as much as you can in that timeframe.


Tasks

Task 1 — Data Ingestion & Cleaning (Required)

Goal: Handle messy, real-world data using dbt.

Implement the staging model models/staging/stg_events.sql to:

  1. Normalize timestamps to a consistent UTC format
  2. Deduplicate events (hint: check event_id)
  3. Handle missing or invalid fields
  4. Standardize data (e.g., case-sensitive user_ids)
  5. Filter or flag invalid records

Deliverable: Complete models/staging/stg_events.sql


Task 2 — Data Modeling (Required)

Goal: Design analytics-ready tables using dimensional modeling.

Implement the mart models in models/marts/:

  • fact_events.sql — Central fact table for event analytics
  • dim_devices.sql — Device dimension
  • dim_users.sql — User dimension (optional but recommended)

Deliverable: Complete the models in models/marts/


Task 3 — Analytics Queries (Required)

Goal: Write SQL to answer 2–3 of the following questions.

  1. Daily active devices
  2. Events per device per day
  3. Top event types in the last 7 days
  4. Average events per user
  5. Event count trends over time

Focus on:

  • Correct joins between fact and dimension tables
  • Readability (use CTEs where appropriate)
  • Performance considerations

Deliverable: Queries in analyses/analytics_queries.sql


Project Structure

├── README.md                       # This file (add your answers below)
├── dbt_project.yml                 # dbt project configuration
├── profiles.yml.example            # Example dbt profile (copy to ~/.dbt/)
├── seeds/
│   └── raw_events.csv              # Raw input data (loaded via dbt seed)
├── models/
│   ├── staging/
│   │   ├── _staging__sources.yml   # Source definitions
│   │   ├── _staging__models.yml    # Model documentation & tests
│   │   └── stg_events.sql          # YOUR WORK: Staging model
│   └── marts/
│       ├── _marts__models.yml      # Model documentation & tests
│       ├── fact_events.sql         # YOUR WORK: Fact table
│       ├── dim_devices.sql         # YOUR WORK: Device dimension
│       └── dim_users.sql           # YOUR WORK: User dimension
├── analyses/
│   └── analytics_queries.sql       # YOUR WORK: Analytics queries
├── tests/                          # Custom data tests (optional)
└── macros/                         # Custom macros (optional)

Getting Started

Prerequisites

  • Python 3.8+
  • dbt-core with a database adapter

Recommended setup (DuckDB - no external database required):

    pip install dbt-core dbt-duckdb

Alternative adapters:

    pip install dbt-postgres   # For PostgreSQL
    pip install dbt-sqlite     # For SQLite

Setup

  1. Clone this repository

  2. Copy the example profile and configure:

    cp profiles.yml.example ~/.dbt/profiles.yml

    Or set DBT_PROFILES_DIR to this directory.

  3. Verify dbt can connect:

    dbt debug
  4. Load the seed data:

    dbt seed
  5. Run your models:

    dbt run
  6. Run tests:

    dbt test

Useful Commands

    dbt seed                    # Load raw_events.csv into database
    dbt run                     # Build all models
    dbt run --select staging    # Build only staging models
    dbt run --select marts      # Build only mart models
    dbt test                    # Run all tests
    dbt compile                 # Compile SQL without running (useful for analyses)
    dbt docs generate           # Generate documentation
    dbt docs serve              # View documentation in browser

Raw Data Issues

The seeds/raw_events.csv contains intentional data quality issues for you to handle:

Issue Examples
Duplicate records evt_001 and evt_016 appear twice
Missing fields Some events missing device_id or user_id
Inconsistent timestamps ISO 8601, Unix epoch, US format, European format
Invalid data Empty event_id, null device_id, invalid timestamp
Case inconsistency usr_100 vs USR_100
Evolving schema Some events have location or device_metadata

Evaluation Criteria

Criteria Weight What We're Looking For
Correctness 30% Does the solution work? Are edge cases handled?
Data Modeling 25% Is the schema well-designed for analytics?
dbt Best Practices 20% Proper use of staging/marts, tests, documentation
Code Quality 15% Readable, maintainable SQL and project organization
Production Thinking 10% Monitoring, error handling, scalability considerations

Your Submission

Instructions: Complete the sections below with your work.

How to Run

# Your commands here

Notes

  • Please do not use cloud services or external accounts
  • Please do not use AI to write your solutions for this project
  • PLEASE DO NOT SPEND MORE THAN 4 HOURS ON THIS PROJECT
  • If you run out of time, document what you would have done next
  • Questions? Email your recruiter or hiring manager

About

SafeStreets interview process take home assessment for Mid-level and Senior Data Engineer candidates

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors