Skip to content

BuildCircle/dbt-kata

Repository files navigation

StreamFlix Analytics Engineer Tech Test

Welcome to the StreamFlix technical assessment!

Scenario

StreamFlix is a fictional video streaming service. We are looking to better understand our Monthly Recurring Revenue (MRR) and Churn rates. You have been given a raw dbt project and access to our raw data dumps.

The Data

You will find three CSV files in the seeds/ directory:

  1. raw_users: User account information.
    • user_id: Unique identifier (mostly).
    • created_at: Account creation timestamp.
    • country: User's country code.
    • marketing_channel: How the user found us.
  2. raw_subscriptions: Subscription history.
    • subscription_id: Unique identifier.
    • user_id: Foreign key to users.
    • plan_type: Basic, Pro, or Premium.
    • status: 'active' or 'cancelled'.
    • start_date: When the subscription started.
    • end_date: When it ended (NULL if active).
  3. raw_payments: Payment transaction logs.
    • payment_id: Unique identifier.
    • subscription_id: Foreign key to subscriptions.
    • amount: Transaction amount.
    • payment_date: Date of payment.
    • status: 'success' or 'failed'.

Setup & Running

This project is configured to use DuckDB, so you don't need to set up an external database. If you are not confident running locally, use a github workspace with a blank template.

Prerequisites

  • Python 3.8+ installed.

Installation

  1. Clone the repository.
  2. Install the required dependencies:
    pip install -r requirements.txt
    This installs dbt-duckdb for the project and pandas for running Python analysis.

Running the Project

To build the models and run the tests:

dbt build --profiles-dir .

This will:

  1. Load the CSV seeds into a local dbt.duckdb file.
  2. Run your models.
  3. Run your tests.

Your Task

Please use dbt to model this data and answer the following business questions.

1. Data Cleaning & Staging

Create staging models (stg_) to clean the raw data.

  • Note: The data engineering team says the source systems are a bit "messy". Watch out for duplicates, test accounts (User ID 999), and data inconsistencies.

2. Data Modeling

Build the following models in your marts layer:

  • dim_users: A user dimension table showing the user's current subscription status and lifetime value.
  • fct_mrr: A monthly snapshot fact table that shows the MRR (Monthly Recurring Revenue) for each user for each month.
    • Tip: Only successful payments count towards revenue.

3. Analysis

Create a simple analysis (SQL file in analyses/ or a dashboard description) to answer:

  • What is the Churn Rate for the last 3 months?
  • Which Marketing Channel has the highest average Lifetime Value (LTV)?

4. Testing

Add dbt tests to ensure your models are reliable. We expect to see at least:

  • Unique and Not Null tests on primary keys.
  • Accepted values tests where appropriate.

Hint: You can define tests in a YAML file in the models/ directory. For example:

version: 2
models:
  - name: dim_users
    columns:
      - name: user_id
        tests:
          - unique
          - not_null

Running Analysis with Python

You can query the DuckDB database directly using Python to perform advanced analysis or generate reports.

Running the Analysis Script

A script run_analysis.py is included. It executes compiled SQL files generated by dbt. This allows you to write your analysis in dbt/SQL (in the analyses/ directory) and run them via Python.

  1. Compile your dbt project:
    dbt compile
    This generates the executable SQL in target/compiled/streamflix/analyses/.
  2. Run the python script with the filename:
    python run_analysis.py new_users_last_month.sql

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages