diff --git a/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/_index.md b/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/_index.md index 381228043b..09de667c8a 100644 --- a/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/_index.md @@ -1,19 +1,24 @@ --- -title: Deploy ClickHouse on Google Cloud C4A Arm virtual machines +title: Build a Real-Time Analytics Pipeline with ClickHouse on Google Cloud Axion (Arm-based C4A VMs) -minutes_to_complete: 30 +minutes_to_complete: 50 -who_is_this_for: This is an introductory topic for developers deploying and optimizing ClickHouse on Arm-based Linux environments using Google Cloud C4A virtual machines powered by Axion processors, to evaluate ClickHouse performance and behaviour on Arm-based infrastructure. +who_is_this_for: This learning path is intended for software developers, data engineers, and platform engineers who want to build and benchmark a real-time analytics pipeline using ClickHouse on Linux/Arm64 environments, specifically Google Cloud C4A virtual machines powered by Axion processors. learning_objectives: - - Provision an Arm-based SUSE SLES virtual machine on Google Cloud using C4A instances powered by Axion processors - - Install and start a ClickHouse server on a SUSE Arm64 (C4A) virtual machine - - Verify ClickHouse functionality by connecting to the server and running basic insert and query operations - - Run baseline ClickHouse performance tests to produce throughput and query latency results for evaluating Arm-based deployments on Google Cloud + - Provision an Arm-based SUSE SLES virtual machine on Google Cloud using C4A (Axion processors) + - Configure Google Cloud Pub/Sub for real-time log ingestion + - Deploy and validate ClickHouse on a SUSE Linux Arm64 (Axion) VM + - Build a streaming ETL pipeline using Apache Beam and Google Dataflow + - Ingest real-time Pub/Sub data into ClickHouse using Dataflow + - Validate end-to-end data flow from Pub/Sub to ClickHouse + - Perform baseline and analytical query benchmarking on ClickHouse running on Arm64 + - Measure and report query latency (including p95) on Axion processors prerequisites: - A [Google Cloud Platform (GCP)](https://cloud.google.com/free) account with billing enabled - Basic familiarity with [ClickHouse](https://clickhouse.com/) + - Basic understanding of databases and SQL author: Pareena Verma @@ -27,7 +32,10 @@ armips: tools_software_languages: - ClickHouse - - clickhouse-benchmark + - Apache Beam + - Google Dataflow + - Google Cloud Pub/Sub + - Python 3.11 operatingsystems: - Linux diff --git a/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/baseline.md b/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/baseline.md index 4a6b558c1b..cb3667a96c 100644 --- a/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/baseline.md +++ b/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/baseline.md @@ -1,6 +1,7 @@ --- title: Establish a ClickHouse baseline on Arm -weight: 5 +weight: 7 + ### FIXED, DO NOT MODIFY layout: learningpathall diff --git a/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/benchmarking.md b/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/benchmarking.md index 2b3ea56389..7dac0ff0f1 100644 --- a/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/benchmarking.md +++ b/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/benchmarking.md @@ -1,286 +1,260 @@ --- -title: Benchmark ClickHouse performance -weight: 6 +title: ClickHouse Benchmarking on Google Axion (Arm) +weight: 9 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Prepare for benchmarking -ClickHouse provides an official benchmarking utility called `clickhouse-benchmark`, which is included in the ClickHouse installation. This tool measures query throughput and latency. +## ClickHouse Benchmarking on Axion Processors +This phase benchmarks **query latency on ClickHouse running on Google Axion (Arm64)**. +The goal is to measure **repeatable query latency** with a focus on **p95 latency**, using data ingested via the real-time Dataflow pipeline. +## Prepare ClickHouse for Accurate Latency Measurement -## Run benchmark tests +### Disable Query Cache +ClickHouse may serve repeated queries from its query cache, which can artificially reduce latency numbers. To ensure that every query is fully executed, the query cache is disabled. -You can benchmark different aspects of ClickHouse performance, including read queries, aggregations, concurrent workloads, and insert operations. +Run this **inside the ClickHouse client**: +```sql +SET use_query_cache = 0; +``` +This ensures every query is executed fully and not served from cache. -### Verify the benchmarking tool exists -Confirm that `clickhouse-benchmark` is installed and available on the system before running performance tests: +### Validate Dataset Size +Ensures enough data is present to produce meaningful latency results. ```console -which clickhouse-benchmark +SELECT count(*) FROM realtime.logs; ``` - -The output is similar to: - +You should see an output similar to: ```output -/usr/bin/clickhouse-benchmark + ┌─count()─┐ +1. │ 5000013 │ -- 5.00 million + └─────────┘ ``` -### Prepare benchmark database and table - -Create a test database and table: - -```console -clickhouse client -``` +If data volume is low, generate additional rows (optional): ```sql -CREATE DATABASE IF NOT EXISTS bench; -USE bench; - -CREATE TABLE IF NOT EXISTS hits -( - event_time DateTime, - user_id UInt64, - url String -) -ENGINE = MergeTree -ORDER BY (event_time, user_id); +INSERT INTO realtime.logs +SELECT + now() - number, + concat('service-', toString(number % 10)), + 'INFO', + 'benchmark message' +FROM numbers(1000000); ``` -The output is similar to: +You should see an output similar to: ```output -Query id: 83485bc4-ad93-4dfc-bafe-c0e2a45c1b34 +Query id: 8fcbefab-fa40-4124-8f23-516fca2b8fdd Ok. -0 rows in set. Elapsed: 0.005 sec. -``` - -Exit the client: - -```console -exit +1000000 rows in set. Elapsed: 0.058 sec. Processed 1.00 million rows, 8.00 MB (17.15 million rows/s., 137.20 MB/s.) +Peak memory usage: 106.54 MiB. ``` -### Load benchmark data +### Define Benchmark Queries +These queries represent common real-time analytics patterns: -Insert one million sample records into the table: +- **Filtered count** – service-level analytics +- **Time-windowed count** – recent activity +- **Aggregation by service** – grouped analytics -```sql -clickhouse-client --query " -INSERT INTO bench.hits -SELECT - now() - number, - number, - concat('/page/', toString(number % 100)) -FROM numbers(1000000)" -``` +Each query scans and processes millions of rows to stress the execution engine. -Verify the data load: +**Query 1 – Filtered Count (Service-level analytics)** ```sql -clickhouse-client --query "SELECT count(*) FROM bench.hits" +SELECT count(*) +FROM realtime.logs +WHERE service = 'service-5'; ``` -The output is similar to: +You should see an output similar to: ```output -1000000 -``` +Query id: cfbab386-7168-42ce-a752-2d5146f68b48 -### Run read query benchmark + ┌─count()─┐ +1. │ 350000 │ + └─────────┘ +1 row in set. Elapsed: 0.013 sec. Processed 6.00 million rows, 74.50 MB (466.81 million rows/s., 5.80 GB/s.) +Peak memory usage: 3.25 MiB. +``` -Measure how fast ClickHouse can scan and count rows using a filter: +**Query 2 – Time-windowed Count (Recent activity)** ```sql -clickhouse-benchmark \ - --host localhost \ - --port 9000 \ - --iterations 10 \ - --concurrency 1 \ - --query "SELECT count(*) FROM bench.hits WHERE url LIKE '/page/%'" +SELECT count(*) +FROM realtime.logs +WHERE event_time >= now() - INTERVAL 10 MINUTE; ``` -The output is similar to: +You should see an output similar to: ```output -Loaded 1 queries. - -Queries executed: 10 (100%). - -localhost:9000, queries: 10, QPS: 63.167, RPS: 63167346.434, MiB/s: 957.833, result RPS: 63.167, result MiB/s: 0.000. - -0% 0.003 sec. -10% 0.003 sec. -20% 0.003 sec. -30% 0.004 sec. -40% 0.004 sec. -50% 0.004 sec. -60% 0.004 sec. -70% 0.004 sec. -80% 0.004 sec. -90% 0.004 sec. -95% 0.005 sec. -99% 0.005 sec. -99.9% 0.005 sec. -99.99% 0.005 sec. +Query id: 7654746b-3068-4663-a5c6-6944d9c2d2b9 + ┌─count()─┐ +1. │ 572 │ + └─────────┘ +1 row in set. Elapsed: 0.003 sec. ``` -### Run aggregation query benchmark - -Test the performance of grouping and aggregation operations: +**Query 3 – Aggregation by Service** ```sql -clickhouse-benchmark \ - --host localhost \ - --port 9000 \ - --iterations 10 \ - --concurrency 2 \ - --query " - SELECT - url, +SELECT + service, count(*) AS total - FROM bench.hits - GROUP BY url - " +FROM realtime.logs +GROUP BY service +ORDER BY total DESC; ``` -The output is similar to: +You should see an output similar to: ```output -Queries executed: 10 (100%). - -localhost:9000, queries: 10, QPS: 67.152, RPS: 67151788.647, MiB/s: 1018.251, result RPS: 6715.179, result MiB/s: 0.153. - -0% 0.005 sec. -10% 0.005 sec. -20% 0.005 sec. -30% 0.007 sec. -40% 0.007 sec. -50% 0.007 sec. -60% 0.007 sec. -70% 0.007 sec. -80% 0.007 sec. -90% 0.007 sec. -95% 0.008 sec. -99% 0.008 sec. -99.9% 0.008 sec. -99.99% 0.008 sec. +Query id: c48c0d30-0ef6-4fb9-bbb9-815a509a5f91 + + ┌─service────┬──total─┐ + 1. │ service-6 │ 350000 │ + 2. │ service-1 │ 350000 │ + 3. │ service-0 │ 350000 │ + 4. │ service-7 │ 350000 │ + 5. │ service-3 │ 350000 │ + 6. │ service-4 │ 350000 │ + 7. │ service-5 │ 350000 │ + 8. │ service-2 │ 350000 │ + 9. │ service-9 │ 350000 │ +10. │ service-8 │ 350000 │ +11. │ service-10 │ 250000 │ +12. │ service-15 │ 250000 │ +13. │ service-16 │ 250000 │ +14. │ service-13 │ 250000 │ +15. │ service-18 │ 250000 │ +16. │ service-17 │ 250000 │ +17. │ service-19 │ 250000 │ +18. │ service-12 │ 250000 │ +19. │ service-11 │ 250000 │ +20. │ service-14 │ 250000 │ +21. │ api │ 12 │ +22. │ local │ 1 │ + └────────────┴────────┘ +22 rows in set. Elapsed: 0.011 sec. Processed 6.00 million rows, 74.50 MB (527.10 million rows/s., 6.54 GB/s.) +Peak memory usage: 7.18 MiB. ``` -### Run concurrent read workload benchmark - -Run multiple queries simultaneously to evaluate how ClickHouse handles higher user load: +### Run Repeatable Latency Measurements +To calculate reliable latency metrics, the same query is executed multiple times(10) using `clickhouse-client --time`. ```sql -clickhouse-benchmark \ - --host localhost \ - --port 9000 \ - --iterations 20 \ - --concurrency 8 \ - --query " - SELECT count(*) - FROM bench.hits - WHERE user_id % 10 = 0 - " +clickhouse-client --time --query " +SELECT count(*) +FROM realtime.logs +WHERE service = 'service-5'; +" ``` -The output is similar to: +You should see an output similar to: ```output -Loaded 1 queries. - -Queries executed: 20 (100%). - -localhost:9000, queries: 20, QPS: 99.723, RPS: 99723096.882, MiB/s: 760.827, result RPS: 99.723, result MiB/s: 0.001. - -0% 0.012 sec. -10% 0.012 sec. -20% 0.013 sec. -30% 0.017 sec. -40% 0.020 sec. -50% 0.029 sec. -60% 0.029 sec. -70% 0.038 sec. -80% 0.051 sec. -90% 0.062 sec. -95% 0.063 sec. -99% 0.078 sec. -99.9% 0.078 sec. -99.99% 0.078 sec. +350000 +0.009 +350000 +0.009 +350000 +0.009 +350000 +0.011 +350000 +0.010 +350000 +0.0011 +350000 +0.009 +350000 +0.009 +350000 +0.009 +350000 +0.011 ``` +**Each run prints:** -### Measure insert performance +- Query result (row count) +- Execution time (seconds) +- Output has row count + time mixed. We only need the time values. -Measure bulk data ingestion speed and write latency: +Edit your file: -```sql -clickhouse-benchmark \ - --iterations 5 \ - --concurrency 4 \ - --query " - INSERT INTO bench.hits - SELECT - now(), - rand64(), - '/benchmark' - FROM numbers(500000) - " +```console +vi latency-results.txt ``` -The output is similar to: -```output -Queries executed: 5 (100%). - -localhost:9000, queries: 5, QPS: 20.935, RPS: 10467305.309, MiB/s: 79.859, result RPS: 0.000, result MiB/s: 0.000. - -0% 0.060 sec. -10% 0.060 sec. -20% 0.060 sec. -30% 0.060 sec. -40% 0.068 sec. -50% 0.068 sec. -60% 0.068 sec. -70% 0.069 sec. -80% 0.069 sec. -90% 0.073 sec. -95% 0.073 sec. -99% 0.073 sec. -99.9% 0.073 sec. -99.99% 0.073 sec. +Only the latency values are required for statistical analysis. Row counts are removed. + +```txt +0.009 +0.009 +0.009 +0.011 +0.010 +0.011 +0.009 +0.009 +0.009 +0.011 ``` -## Understand benchmark metrics +- Clean input for sorting and percentile calculation. +- Remove 350000 lines if they exist. -The benchmarking output includes several key metrics: +**Sort the latency values:** +Latency values are sorted in ascending order to compute percentiles. -- QPS (Queries Per Second): number of complete queries ClickHouse can execute per second. Higher QPS reflects stronger overall query execution capacity. -- RPS (Rows Per Second): number of rows processed every second. Very high RPS values demonstrate ClickHouse's efficiency in scanning large datasets. -- MiB/s (Throughput): data processed per second in mebibytes. High throughput indicates effective CPU, memory, and disk utilization during analytics workloads. -- Latency Percentiles (p50, p95, p99): query response times. p50 is the median latency, while p95 and p99 show tail latency under heavier load, which is critical for understanding performance consistency. -- Iterations: number of times the same query is executed. More iterations improve measurement accuracy and stability. -- Concurrency: number of parallel query clients. Higher concurrency tests ClickHouse's ability to scale under concurrent workloads. -- Result RPS / Result MiB/s: size and rate of returned query results. Low values are expected for aggregate queries like `COUNT(*)`. -- Insert Benchmark Metrics: write tests measure ingestion speed and stability. Consistent latency indicates reliable bulk insert performance. +```console +sort -n latency-results.txt +``` +```output +0.009 +0.009 +0.009 +0.009 +0.009 +0.009 +0.010 +0.011 +0.011 +0.011 +``` -## Review the benchmark results +**Calculate p95 latency (manual):** +The p95 latency represents the value under which 95% of query executions complete. -The table below summarizes baseline read, aggregation, concurrent workload, and insert performance for ClickHouse running on a `c4a-standard-4` (4 vCPU, 16 GB memory) Arm64 virtual machine. +**Formula:** -Use these results as a reference point for this specific configuration. They are intended to support comparison across different instance sizes, configurations, or architectures rather than to represent an absolute performance benchmark. +```pqsql +p95 index = ceil(0.95 × N) +``` + +For 10 samples: +```cpp +ceil(0.95 × 10) = ceil(9.5) = 10 +``` -| Test Category | Test Case | Query / Operation | Iterations | Concurrency | QPS | Rows / sec (RPS) | Throughput (MiB/s) | p50 Latency | p95 Latency | p99 Latency | -| ----------------------- | -------------- | -------------------------------------- | ---------: | ----------: | ----: | ---------------: | -----------------: | ----------: | ----------: | ----------: | -| Read | Filtered COUNT | `COUNT(*) WHERE url LIKE '/page/%'` | 10 | 1 | 63.17 | 63.17 M | 957.83 | 4 ms | 5 ms | 5 ms | -| Read / Aggregate | GROUP BY | `GROUP BY url` | 10 | 2 | 67.15 | 67.15 M | 1018.25 | 7 ms | 8 ms | 8 ms | -| Read (High Concurrency) | Filtered COUNT | `COUNT(*) WHERE user_id % 10 = 0` | 20 | 8 | 99.72 | 99.72 M | 760.83 | 29 ms | 63 ms | 78 ms | -| Write | Bulk Insert | `INSERT SELECT … FROM numbers(500000)` | 5 | 4 | 20.94 | 10.47 M | 79.86 | 68 ms | 73 ms | 73 ms | +The 10th value in the sorted list is your p95 latency. -### Observations +**p95 result** -- Filtered read and aggregation queries processed between 63–67 million rows per second for this dataset and configuration. -- Under higher concurrency (8 parallel clients), the system sustained close to 100 million rows per second, with increased tail latency as expected. -- Aggregation queries using `GROUP BY` achieved over 1 GiB/s of throughput at moderate concurrency. -- Bulk insert tests showed consistent latency across iterations for the tested insert size and concurrency level. +```txt +p95 latency = 0.011 seconds ≈ 11 ms +``` -These results provide a baseline for this environment and can be used to compare alternative configurations, instance sizes, or architectures in subsequent testing. +The ClickHouse query was executed 10 times on a GCP Axion (Arm) VM. Observed p95 query latency was ~11 ms, demonstrating consistently low-latency analytical performance on Arm-based infrastructure. +### Benchmark summary +Results from the earlier run on the `c4a-standard-4` (4 vCPU, 16 GB memory) Arm64 VM in GCP (SUSE): +- ClickHouse on **Google Axion (Arm64)** delivered consistently low query latency, even while scanning ~6 million rows per query. +- Across **10 repeat executions, the p95 latency was ~11 ms**, indicating stable and predictable performance. +- Disabling the query cache ensured true execution latency, not cache-assisted results. +- Analytical queries sustained **500M+ rows/sec throughput** with minimal memory usage. diff --git a/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/dataflow-streaming-etl.md b/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/dataflow-streaming-etl.md new file mode 100644 index 0000000000..458e98938e --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/dataflow-streaming-etl.md @@ -0,0 +1,279 @@ +--- +title: Dataflow Streaming ETL to ClickHouse +weight: 8 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Dataflow Streaming ETL (Pub/Sub → ClickHouse) +This section implements a real-time streaming ETL pipeline that ingests events from Pub/Sub, processes them using Dataflow (Apache Beam), and writes them into ClickHouse running on a GCP Axion (Arm64) VM. + +## Pipeline Overview +Flow + +```bash +Pub/Sub → Dataflow (Apache Beam) → ClickHouse (Axion VM) +``` + +**Key components:** + +- Pub/Sub: event ingestion +- Dataflow: streaming ETL and transformation +- ClickHouse: real-time analytical storage on Arm64 + +### Install Python 3.11 on the Axion VM +Install Python 3.11 and the required system packages + +```console +sudo zypper refresh +sudo zypper install -y python311 python311-pip python311-devel gcc gcc-c++ +``` + +Verify installation: + +```console +python3.11 --version +pip3.11 --version +``` + +### Create a Python Virtual Environment (Recommended) +Using a virtual environment avoids dependency conflicts with the system Python. + +```console +python3.11 -m venv beam-venv +source beam-venv/bin/activate +``` + +### Install Apache Beam with GCP Support +Install Apache Beam and the required dependencies for Dataflow: + +```console +pip install --upgrade pip +pip install "apache-beam[gcp]" +pip install requests +``` + +Verify Beam installation: + +```console +python -c "import apache_beam; print(apache_beam.__version__)" +``` + +### Prepare ClickHouse for Streaming Ingestion + +Connect to ClickHouse on the Axion VM: + +```console +clickhouse client +``` + +**Creates the target database and table for streaming inserts:** + +```sql +CREATE DATABASE IF NOT EXISTS realtime; + +CREATE TABLE IF NOT EXISTS realtime.logs +( + event_time DateTime, + service String, + level String, + message String +) +ENGINE = MergeTree +ORDER BY event_time; +``` + +Verify the table: + +```sql +SHOW TABLES FROM realtime; +``` +```output +Query id: aa25de9d-c07f-4538-803f-5473744631bc + + ┌─name─┐ +1. │ logs │ + └──────┘ +1 row in set. Elapsed: 0.001 sec. +``` + +**Exit the client:** + +```sql +exit; +``` + +### Validate Pub/Sub (Before Dataflow) +Before running Dataflow, confirm that messages can be published and pulled. + +**Publish a test message:** + +```console +gcloud pubsub topics publish logs-topic \ + --message '{"event_time":"2025-12-30 12:00:00","service":"api","level":"INFO","message":"PRE-DATAFLOW TEST"}' +``` + +**Pull the message:** + +```console +gcloud pubsub subscriptions pull logs-sub --limit=1 --auto-ack +``` + +```output +┌───────────────────────────────────────────────────────────────────────────────────────────────────┬───────────────────┬──────────────┬────────────┬──────────────────┬────────────┐ +│ DATA │ MESSAGE_ID │ ORDERING_KEY │ ATTRIBUTES │ DELIVERY_ATTEMPT │ ACK_STATUS │ +├───────────────────────────────────────────────────────────────────────────────────────────────────┼───────────────────┼──────────────┼────────────┼──────────────────┼────────────┤ +│ {"event_time":"2025-12-30 12:00:00","service":"api","level":"INFO","message":"PRE-DATAFLOW TEST"} │ 17590032987110080 │ │ │ │ SUCCESS │ +└───────────────────────────────────────────────────────────────────────────────────────────────────┴───────────────────┴──────────────┴────────────┴──────────────────┴────────────┘ +``` + +Successful output confirms: + +- Pub/Sub topic is writable +- Subscription is readable +- IAM is functioning correctly + +### Create Dataflow Streaming ETL Script +Create the Dataflow pipeline file: + +Purpose +Defines a streaming Beam pipeline that: + +- Reads JSON events from Pub/Sub +- Parses messages +- Writes rows to ClickHouse over HTTP + +```console +vi dataflow_etl.py +``` +Paste the following production-ready streaming pipeline: + +```python +import json +import apache_beam as beam +from apache_beam.options.pipeline_options import PipelineOptions + +PROJECT_ID = "" +SUBSCRIPTION = "projects//subscriptions/" +CLICKHOUSE_URL = "projects//subscriptions/" + +class ParseMessage(beam.DoFn): + def process(self, element): + yield json.loads(element.decode("utf-8")) + +class WriteToClickHouse(beam.DoFn): + def process(self, element): + import requests + row = ( + f"{element['event_time']}\t" + f"{element['service']}\t" + f"{element['level']}\t" + f"{element['message']}\n" + ) + requests.post( + CLICKHOUSE_URL, + data=row, + headers={"Content-Type": "text/tab-separated-values"}, + params={"query": "INSERT INTO realtime.logs FORMAT TabSeparated"} + ) + +options = PipelineOptions( + streaming=True, + save_main_session=True +) + +with beam.Pipeline(options=options) as p: + ( + p + | "Read from PubSub" >> beam.io.ReadFromPubSub(subscription=SUBSCRIPTION) + | "Parse JSON" >> beam.ParDo(ParseMessage()) + | "Write to ClickHouse" >> beam.ParDo(WriteToClickHouse()) + ) +``` + +Pipeline logic: + +- **ReadFromPubSub** → read streaming messages +- **ParseMessage** → decode JSON +- **WriteToClickHouse** → insert into ClickHouse using TabSeparated format + +Replace ``, ``, and `` with your existing GCP project ID, Pub/Sub subscription name, and the internal IP address of your ClickHouse VM. + +Below are the exact commands you can run from your VM to get each required value: + +```console +gcloud config get-value project +gcloud pubsub subscriptions list +hostname -I +``` + +### Run the Dataflow Streaming Job +Launches the pipeline on managed Dataflow workers. + +```console +python3.11 dataflow_etl.py \ + --runner=DataflowRunner \ + --project= \ + --region= \ + --temp_location=gs:///temp \ + --streaming +``` + +- `` – Your Google Cloud project ID (e.g. my-project-123) +- `` – Region where Dataflow runs (e.g. us-central1) +- `` – Existing GCS bucket used for Dataflow temp files + +```output +Autoscaling is enabled for Dataflow Streaming Engine. Workers will scale between 1 and 100 unless maxNumWorkers is specified. +``` + +**This indicates:** + +- Streaming mode is active +- Workers scale automatically + +### End-to-End Validation +Publish live streaming data. + +```console +gcloud pubsub topics publish logs-topic \ + --message '{"event_time":"2025-12-30 13:30:00","service":"api","level":"INFO","message":"FRESH DATAFLOW WORKING"}' +``` + +Verify data in ClickHouse: + +```sql +SELECT * +FROM realtime.logs +ORDER BY event_time DESC +LIMIT 5; +``` + +Output: + +```output +SELECT * +FROM realtime.logs +ORDER BY event_time DESC +LIMIT 5 + +Query id: 74a105d0-2e04-4053-825c-d30e53424d14 + + ┌──────────event_time─┬─service───┬─level─┬─message────────────────┐ +1. │ 2025-12-30 13:30:00 │ api │ INFO │ FRESH DATAFLOW WORKING │ +2. │ 2025-12-30 13:00:00 │ api │ INFO │ DATAFLOW FINAL SUCCESS │ +3. │ 2025-12-30 12:45:00 │ api │ INFO │ FINAL DATAFLOW SUCCESS │ +4. │ 2025-12-30 08:48:35 │ service-0 │ INFO │ benchmark message 0 │ +5. │ 2025-12-30 08:48:34 │ service-1 │ INFO │ benchmark message 1 │ + └─────────────────────┴───────────┴───────┴────────────────────────┘ +```` + +This confirms: + +- Pub/Sub events are streamed continuously +- Dataflow processes data in real time +- ClickHouse ingests data on Axion (Arm64) via HTTP +- The end-to-end real-time pipeline is operational + +This pipeline serves as the foundation for ClickHouse latency benchmarking and real-time analytics on Google Axion. diff --git a/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/gcp_firewall_setup.md b/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/gcp_firewall_setup.md new file mode 100644 index 0000000000..4e7c866064 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/gcp_firewall_setup.md @@ -0,0 +1,38 @@ +--- +title: Create a Firewall Rule on GCP +weight: 3 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Overview + +In this section, you will create a firewall rule in the Google Cloud Console to allow inbound TCP traffic on port 8123. + +{{% notice Note %}} +For support on GCP setup, see the Learning Path [Getting started with Google Cloud Platform](/learning-paths/servers-and-cloud-computing/csp/google/). +{{% /notice %}} + +## Create a Firewall Rule in GCP + +To expose the TCP port 8123, create a firewall rule. + +Navigate to the [Google Cloud Console](https://console.cloud.google.com/), go to **VPC Network > Firewall**, and select **Create firewall rule**. + +![Create a firewall rule alt-text#center](images/firewall-rule1.png "Create a firewall rule") + +Set the **Name** of the new rule to "allow-tcp-8123". Select your network that you intend to bind to your VM (default is "autoscaling-net", but your organization might have others). + + +![Create a firewall rule alt-text#center](images/network-rule2.png "Creating the TCP/8123 firewall rule") + +Next, Set **Direction of traffic** to "Ingress". Set **Allow on match** to "Allow" and **Targets** to "Specified target tags". Set **Source IPv4 ranges** to "0.0.0.0/0". + +![Create a firewall rule alt-text#center](images/network-rule3.png "Creating the TCP/8123 firewall rule") + +Finally, select **Specified protocols and ports** under the **Protocols and ports** section. Select the **TCP** checkbox, enter "8123" in the **Ports** text field, and select **Create**. + +![Specifying the TCP port to expose alt-text#center](images/network-port.png "Specifying the TCP port to expose") + +The network firewall rule is now created, and you can continue with the VM creation. diff --git a/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/images/bucket.png b/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/images/bucket.png new file mode 100644 index 0000000000..b52ffc8b8d Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/images/bucket.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/images/firewall-rule1.png b/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/images/firewall-rule1.png new file mode 100644 index 0000000000..e1ab8aecb5 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/images/firewall-rule1.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/images/network-port.png b/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/images/network-port.png new file mode 100644 index 0000000000..412969a8dc Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/images/network-port.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/images/network-rule2.png b/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/images/network-rule2.png new file mode 100644 index 0000000000..2e19a5e176 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/images/network-rule2.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/images/network-rule3.png b/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/images/network-rule3.png new file mode 100644 index 0000000000..4c36fdae08 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/images/network-rule3.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/images/pub_sub1.png b/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/images/pub_sub1.png new file mode 100644 index 0000000000..aad17a09e6 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/images/pub_sub1.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/images/pub_sub2.png b/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/images/pub_sub2.png new file mode 100644 index 0000000000..a380949fb6 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/images/pub_sub2.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/images/roles.png b/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/images/roles.png new file mode 100644 index 0000000000..3e88f8ab66 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/images/roles.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/images/verify_pub_sub.png b/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/images/verify_pub_sub.png new file mode 100644 index 0000000000..5d648d44d7 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/images/verify_pub_sub.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/installation.md b/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/installation.md index 79e7d4396d..7f66a8cadf 100644 --- a/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/installation.md +++ b/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/installation.md @@ -1,20 +1,63 @@ --- title: Install ClickHouse -weight: 4 +weight: 6 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Install ClickHouse on GCP VM +## Install ClickHouse and gcloud CLI on GCP VM +This guide covers the installation of **Google Cloud CLI (gcloud)** and **ClickHouse** on a GCP SUSE Linux Arm64 (Axion C4A) VM. +These tools are required to: +- Interact with GCP services such as **Pub/Sub** and **Dataflow** +- Store and query real-time analytics data efficiently using **ClickHouse on Arm64** -This section shows you how to install and validate ClickHouse on your Google Cloud SUSE Linux Arm64 virtual machine. You’ll install ClickHouse using the official repository, verify the installation, start the server, connect using the client, and configure ClickHouse to run as a systemd service for reliable startup. +### Install Google Cloud CLI (gcloud) +The Google Cloud CLI is required to authenticate with GCP, publish Pub/Sub messages, and submit Dataflow jobs from the VM. -{{% notice Note %}}On some SUSE configurations, the ClickHouse system user and runtime directories might not be created automatically. The following steps ensure ClickHouse has the required paths and permissions.{{% /notice %}} +### Download gcloud SDK (Arm64) -### Install required system packages and add the ClickHouse repository +```console +curl -O https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-cli-linux-arm.tar.gz +``` + +### Extract and install gcloud + +```console +tar -xzf google-cloud-cli-linux-arm.tar.gz +cd google-cloud-sdk +./install.sh +``` + +Accept the default options during installation. + +### Initialize gcloud + +```console +source ~/.bashrc +gcloud init +``` + +During initialization: + +- Select the correct project (for example: imperial-time-xxxxxx) +- Choose the default region (for example: us-central1) + +### Verify authentication + +```console +gcloud auth list +``` + +You should see an output similar to: +```output +Credentialed Accounts +ACTIVE ACCOUNT +* -compute@developer.gserviceaccount.com +``` -Refresh system repositories and add the ClickHouse repository: +### Install required system packages and the ClickHouse repo +Refresh system repositories and install basic utilities needed to download and run ClickHouse. ```console sudo zypper refresh @@ -22,24 +65,22 @@ sudo zypper addrepo -r https://packages.clickhouse.com/rpm/clickhouse.repo -g sudo zypper --gpg-auto-import-keys refresh clickhouse-stable ``` -### Install ClickHouse - -Install ClickHouse server and client: +### Install ClickHouse via the ClickHouse repo +Download and install ClickHouse for SuSE systems: ```console sudo zypper install -y clickhouse-server clickhouse-client ``` -This installs the following components: +This installs: -- ClickHouse Server: runs the core database engine and handles data storage, queries, and processing. -- ClickHouse Client: provides a command-line interface to connect to the server and run SQL queries. -- ClickHouse Local: allows running SQL queries on local files without starting a server. -- Default configuration files (`/etc/clickhouse-server`): stores server settings such as ports, users, storage paths, and performance tuning options. +- **ClickHouse Server** – Runs the core database engine and handles all data storage, queries, and processing. +- **ClickHouse Client** – Provides a command-line interface to connect to the server and run SQL queries. +- **ClickHouse Local** – Allows running SQL queries on local files without starting a server. +- **Default configuration files (/etc/clickhouse-server)** – Stores server settings such as ports, users, storage paths, and performance tuning options. ### Verify the installed version - -Verify that ClickHouse is installed: +Confirm that all ClickHouse components are installed correctly by checking their versions. ```console clickhouse --version @@ -48,7 +89,7 @@ clickhouse client --version clickhouse local --version ``` -The output is similar to: +You should see an output similar to: ```output ClickHouse local version 25.11.2.24 (official build). ClickHouse server version 25.11.2.24 (official build). @@ -56,8 +97,7 @@ ClickHouse client version 25.11.2.24 (official build). ``` ### Create ClickHouse user and directories - -Create a dedicated system user and required directories for data, logs, and runtime files: +Create a dedicated system user and required directories for data, logs, and runtime files. ```console sudo useradd -r -s /sbin/nologin clickhouse || true @@ -65,8 +105,7 @@ sudo mkdir -p /var/lib/clickhouse sudo mkdir -p /var/log/clickhouse-server sudo mkdir -p /var/run/clickhouse-client ``` - -Set proper ownership: +Set proper ownership so ClickHouse can access these directories. ```console sudo chown -R clickhouse:clickhouse \ @@ -78,31 +117,26 @@ sudo chmod 755 /var/lib/clickhouse \ /var/run/clickhouse-client ``` -### Start ClickHouse server manually - -Run the ClickHouse server in the foreground to confirm the configuration is valid: +### Start ClickHouse Server manually +You can just run the ClickHouse server in the foreground to confirm the configuration is valid. ```console sudo -u clickhouse clickhouse server --config-file=/etc/clickhouse-server/config.xml ``` - Keep this terminal open while testing. -### Connect using ClickHouse client - -Open a new SSH terminal and connect to the ClickHouse server: +### Connect using ClickHouse Client +Open a new SSH terminal and connect to the ClickHouse server. ```console clickhouse client ``` - -Run a test query to confirm connectivity: +Run a test query to confirm connectivity. ```sql SELECT version(); ``` - -The output is similar to: +You should see an output similar to: ```output SELECT version() @@ -115,11 +149,17 @@ Query id: ddd3ff38-c0c6-43c5-8ae1-d9d07af4c372 1 row in set. Elapsed: 0.001 sec. ``` -Close the client SSH terminal and press `Ctrl+C` in the server SSH terminal to stop the manual invocation of ClickHouse. The server may take a few seconds to shut down. +Please close the client SSH terminal and press "ctrl-c" in the server SSH terminal to halt the manual invocation of ClickHouse. FYI, the server may take a few seconds to close down when "ctrl-c" is received. -### Create a systemd service +{{% notice Note %}} +Recent benchmarks show that ClickHouse (v22.5.1.2079-stable) delivers up to 26% performance improvements on Arm-based platforms, such as AWS Graviton3, compared to other architectures, highlighting the efficiency of its vectorized execution engine on modern Arm CPUs. +You can view [this Blog](https://community.arm.com/arm-community-blogs/b/servers-and-cloud-computing-blog/posts/improve-clickhouse-performance-up-to-26-by-using-aws-graviton3) -Set up ClickHouse as a system service so it starts automatically on boot: +The [Arm Ecosystem Dashboard](https://developer.arm.com/ecosystem-dashboard/) recommends ClickHouse version v22.5.1.2079-stable, the minimum recommended on the Arm platforms. +{{% /notice %}} + +### Create a systemd service +Set up ClickHouse as a system service so it starts automatically on boot. ```console sudo tee /etc/systemd/system/clickhouse-server.service <<'EOF' @@ -140,29 +180,28 @@ LimitNOFILE=1048576 WantedBy=multi-user.target EOF ``` +**Reload systemd and enable the service:** -Reload systemd and enable the service: ```console -sudo systemctl daemon-reload sudo systemctl enable clickhouse-server sudo systemctl start clickhouse-server +sudo systemctl daemon-reload ``` {{% notice Note %}} -You might see the following error, which can be safely ignored: +You may get the following error which can be safely ignored: -`ln: failed to create symbolic link '/etc/init.d/rc2.d/S50clickhouse-server': No such file or directory` +"ln: failed to create symbolic link '/etc/init.d/rc2.d/S50clickhouse-server': No such file or directory" {{% /notice %}} -## Verify ClickHouse service - -Verify the ClickHouse server is running as a background service: +### Verify ClickHouse service +Ensure the ClickHouse server is running correctly as a background service. ```console sudo systemctl status clickhouse-server ``` -The output is similar to: +This confirms that the ClickHouse server is running correctly under systemd and ready to accept connections. ```output ● clickhouse-server.service - ClickHouse Server @@ -177,8 +216,7 @@ The output is similar to: ``` ### Final validation - -Reconnect to ClickHouse and confirm it's operational: +Reconnect to ClickHouse and confirm it is operational. ```console clickhouse client @@ -188,7 +226,7 @@ clickhouse client SELECT version(); ``` -The output is similar to: +You should see an output similar to: ```output SELECT version() @@ -201,4 +239,5 @@ Query id: ddd3ff38-c0c6-43c5-8ae1-d9d07af4c372 1 row in set. Elapsed: 0.001 sec. ``` -ClickHouse is now installed, configured, and running on SUSE Linux Arm64 with automatic startup enabled. +ClickHouse and gcloud CLI are now successfully installed, configured, and validated on a GCP Axion (Arm64) VM. +The system is ready for Pub/Sub testing and Dataflow ETL in the next phase. diff --git a/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/instance.md b/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/instance.md index 0338006801..5bb339cd2f 100644 --- a/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/instance.md +++ b/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/instance.md @@ -1,6 +1,6 @@ --- -title: Create a Google Axion C4A Arm virtual machine -weight: 3 +title: Create a Google Axion C4A Arm virtual machine on GCP +weight: 4 ### FIXED, DO NOT MODIFY layout: learningpathall @@ -8,34 +8,36 @@ layout: learningpathall ## Overview -In this section, you’ll provision a Google Axion C4A Arm virtual machine on Google Cloud Platform (GCP) using the `c4a-standard-4` machine type (4 vCPUs, 16 GB memory). This configuration provides a consistent baseline for deploying and evaluating ClickHouse later in the Learning Path. +In this section, you will learn how to provision a Google Axion C4A Arm virtual machine on Google Cloud Platform (GCP) using the `c4a-standard-4` (4 vCPUs, 16 GB memory) machine type in the Google Cloud Console. -{{% notice Note %}} For help with GCP setup, see [Getting started with Google Cloud Platform](/learning-paths/servers-and-cloud-computing/csp/google/).{{% /notice %}} +{{% notice Note %}} +For support on GCP setup, see the Learning Path [Getting started with Google Cloud Platform](/learning-paths/servers-and-cloud-computing/csp/google/). +{{% /notice %}} -## Provision a Google Axion C4A Arm virtual machine +## Provision a Google Axion C4A Arm VM in Google Cloud Console To create a virtual machine based on the C4A instance type: - - Navigate to the [Google Cloud Console](https://console.cloud.google.com/). -- Go to **Compute Engine > VM Instances** and select **Create Instance**. +- Go to **Compute Engine > VM Instances** and select **Create Instance**. - Under **Machine configuration**: - - Populate fields such as **Instance name**, **Region**, and **Zone**. - - Set **Series** to `C4A`. - - Select `c4a-standard-4` for the machine type. + - Populate fields such as **Instance name**, **Region**, and **Zone**. + - Set **Series** to `C4A`. + - Select `c4a-standard-4` for machine type. + + ![Create a Google Axion C4A Arm virtual machine in the Google Cloud Console with c4a-standard-4 selected alt-text#center](images/gcp-vm.png "Creating a Google Axion C4A Arm virtual machine in Google Cloud Console") - ![Create a Google Axion C4A Arm virtual machine in the Google Cloud Console with c4a-standard-4 selected alt-text#center](images/gcp-vm.png "Creating a Google Axion C4A Arm virtual machine in Google Cloud Console") -- Under **OS and Storage**, select **Change**, then choose an Arm64-based OS image. For this Learning Path, use **SUSE Linux Enterprise Server**. -- Select **Pay As You Go** for the license type, then click **Select**. +- Under **OS and Storage**, select **Change**, then choose an Arm64-based OS image. For this Learning Path, use **SUSE Linux Enterprise Server**. +- If using use **SUSE Linux Enterprise Server**. Select "Pay As You Go" for the license type. +- Once appropriately selected and configured, please Click **Select**. - Under **Networking**, enable **Allow HTTP traffic**. - Click **Create** to launch the instance. +- Once created, you should see a "SSH" option to the right in your list of VM instances. Click on this to launch a SSH shell into your VM instance: -Once the instance is created, you should see an **SSH** option to the right of the VM in the list. Click this to open an SSH session in your browser: - -![Invoke an SSH session via your browser alt-text#center](images/gcp-ssh.png "Invoke an SSH session into your running VM instance") +![Invoke a SSH session via your browser alt-text#center](images/gcp-ssh.png "Invoke a SSH session into your running VM instance") -A terminal window opens, showing a shell connected to your VM: +- A window from your browser should come up and you should now see a shell into your VM instance: -![Terminal shell in your VM instance alt-text#center](images/gcp-shell.png "Terminal shell in your VM instance") +![Terminal Shell in your VM instance alt-text#center](images/gcp-shell.png "Terminal shell in your VM instance") -Next, you’ll install ClickHouse on the running Arm-based virtual machine. \ No newline at end of file +Next, let's install ClickHouse! diff --git a/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/pub_sub_creation.md b/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/pub_sub_creation.md new file mode 100644 index 0000000000..d035e492a7 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/clickhouse-gcp/pub_sub_creation.md @@ -0,0 +1,127 @@ +--- +title: GCP Pub/Sub and IAM Setup for ClickHouse Real-Time Analytics on Axion +weight: 5 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Pub/Sub and IAM Setup on GCP (UI-first) +This section prepares the **Google Cloud messaging and access foundation** required for the real-time analytics pipeline. +It focuses on **Pub/Sub resource creation and IAM roles**, ensuring Dataflow and the Axion VM can securely communicate. + +### Create Pub/Sub Topic +The Pub/Sub topic acts as the **ingestion entry point** for streaming log events. + +1. Open **Google Cloud Console** +2. Navigate to **Pub/Sub → Topics** +3. Click **Create Topic** +4. Enter: + - **Topic ID:** `logs-topic` +5. Leave encryption and retention as the default +6. Click **Create** + +This topic will receive streaming log messages from producers. + +![ GCP onsole alt-text#center](images/pub_sub1.png "Figure 1: Pub/Sub Topic") + +### Create Pub/Sub Subscription + +The subscription allows **Dataflow to pull messages** from the topic. + +1. Open the newly created `logs-topic` +2. Click **Create Subscription** +3. Configure: + - **Subscription ID:** `logs-sub` + - **Delivery type:** Pull + - **Ack deadline:** Default (10 seconds) +4. Click **Create** + +![ GCP onsole alt-text#center](images/pub_sub2.png "Figure 2: Pub/Sub Subscription") + +This subscription will later be referenced by the Dataflow pipeline. + +### Verify Pub/Sub Resources + +Navigate to **Pub/Sub → Topics** and confirm: + +- Topic: `logs-topic` +- Subscription: `logs-sub` + +This confirms the messaging layer is ready. + +![ GCP onsole alt-text#center](images/verify_pub_sub.png "Figure 3: Pub/Sub Resources") + +### Identify Compute Engine Service Account + +Dataflow and the Axion VM both rely on the **Compute Engine default service account**. + +Navigate to: + +**IAM & Admin → IAM** + +Locate the service account in the format: + +```bash +-compute@developer.gserviceaccount.com +``` + +This account will be granted the required permissions. + +### Assign Required IAM Roles + +Grant the following roles to the **Compute Engine default service account**: + +| Role | Purpose | +|----|----| +| Dataflow Admin | Create and manage Dataflow jobs | +| Dataflow Worker | Execute Dataflow workers | +| Pub/Sub Subscriber | Read messages from Pub/Sub | +| Pub/Sub Publisher | Publish test messages | +| Storage Object Admin | Read/write Dataflow temp files | +| Service Account User | Allow service account usage | + +**Steps (UI):** +1. Go to **IAM & Admin → IAM** +2. Click **Grant Access** +3. Add the service account +4. Assign the roles listed above +5. Save + +![ GCP onsole alt-text#center](images/roles.png "Figure 4: Required IAM Roles") + +VM OAuth scopes are limited by default. IAM roles are authoritative. + +### Create GCS Bucket for Dataflow (UI) + +Dataflow requires a Cloud Storage bucket for staging and temp files. + +1. Go to **Cloud Storage → Buckets** +2. Click **Create** +3. Configure: + - **Bucket name:** `imperial-time-463411-q5-dataflow-temp` + - **Location type:** Region + - **Region:** `us-central1` +4. Leave defaults for storage class and access control +5. Click **Create** + +![ GCP onsole alt-text#center](images/bucket.png "Figure 5: GCS Bucket") + +### Grant Bucket Access + +Ensure the Compute Engine service account has access to the bucket: + +- Role: **Storage Object Admin** + +This allows Dataflow workers to upload and read job artifacts. + +### Validation Checklist + +Before proceeding, confirm: + +- Pub/Sub topic exists (`logs-topic`) +- Pub/Sub subscription exists (`logs-sub`) +- IAM roles are assigned correctly +- GCS temp bucket is created and accessible + +With Pub/Sub and IAM configured, the environment is now ready for **Axion VM setup and ClickHouse installation** in the next phase.