Using Hop to transform data and write to another database #6308
Unanswered
HaruspexSan
asked this question in
Q&A
Replies: 1 comment
-
|
@harrisward created a sample repository that loads data from the MySQL Sakila database to a small data warehouse, you'll find some great examples there. To aggregate: Group By transform. This requires sorted data. For small data sets that fit entirely in memory, you can use the Memory Group By transform. For the deleted users: either exclude those users from your input query or with a Filter Rows transform. Your questions:
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hello Apache-Hop Community,
Disclaimer: I am new to Apache Hop and using it for my Master's thesis. I appreciate any guidance!
The Goal: I am building a pipeline to populate a BI fact table from a production Postgres database.
Source: A Users table containing raw transactional data (Granularity: 1 row per user).
Fields: ID, created_at, deleted_at, status etc.
Target: A Daily_Stats BI table (Granularity: 1 row per day).
Desired Fields: Date, NewUsersThatDay, TotalUserCount (Cumulative).
The Problem: I am struggling to transform the row-level user data into the aggregated daily format within a single pipeline.
Especially with the issue of different numbers of input fields per stream part.
Furthermore, users with the deleted_at field not being null be deleted after 30 days. Which means I need to add the functionality that after the status changes from active to deleted, i need to reduce the running count by 1 as the user would otherwise be counted in the "rolling count" eventhough it was deleted.
So my questions are:
Is all of this possible in one big pipeline? Or do I have to split it up into several?
How do i use scripting or SQL or any other transform to calculate the rolling count, especially because i need the last rolling count as input from the database and table i will be writing my output to?
How do i calculate the NewUsersThatDay field as i have to aggregate not within a row but with the rows from that day?
Thank you so much in advance
Beta Was this translation helpful? Give feedback.
All reactions