Skip to content

Latest commit

 

History

History
349 lines (248 loc) · 7.41 KB

File metadata and controls

349 lines (248 loc) · 7.41 KB

Local Spark / Trino / HMS Startup Commands

This document records the actual local commands used to start Spark, Trino, and HMS in the current environment.

1. Shared Environment

Load the shared environment first:

source /home/ubuntu/disk1/opt/conf/pixels-delta-env.sh

2. Start HMS

Use the existing script:

/home/ubuntu/disk1/opt/run/start-metastore.sh

This script:

  • loads pixels-delta-env.sh
  • sets JAVA_HOME=$JAVA11_HOME
  • exports AWS credentials
  • starts HMS in the background

Common checks:

cat /home/ubuntu/disk1/opt/run/metastore.pid
ss -ltn | rg ':9083'
tail -f /home/ubuntu/disk1/opt/logs/metastore.out

3. Start Trino

Use the existing script:

/home/ubuntu/disk1/opt/run/start-trino.sh

This script:

  • loads pixels-delta-env.sh
  • sets JAVA_HOME=$JAVA23_HOME
  • starts Trino in the background

Common checks:

cat /home/ubuntu/disk1/opt/run/trino.pid
ss -ltn | rg ':8080'
tail -f /home/ubuntu/disk1/opt/logs/trino.out

Before Trino can query Delta tables on S3, confirm that the delta_lake catalog has S3 settings.

Live file:

/home/ubuntu/disk1/opt/trino-server-466/etc/catalog/delta_lake.properties

Repository template:

./etc/trino-delta_lake.properties.example

Minimal configuration:

connector.name=delta_lake
hive.metastore.uri=thrift://127.0.0.1:9083
delta.register-table-procedure.enabled=true
delta.enable-non-concurrent-writes=true
fs.native-s3.enabled=true
s3.aws-access-key=YOUR_AWS_ACCESS_KEY_ID
s3.aws-secret-key=YOUR_AWS_SECRET_ACCESS_KEY
s3.region=us-east-2
s3.endpoint=https://s3.us-east-2.amazonaws.com

Restart Trino after editing delta_lake.properties:

pkill -f 'trino-server-466' || true
/home/ubuntu/disk1/opt/run/start-trino.sh

4. Start Spark

This project usually does not run a standalone Spark service first. It normally submits jobs directly.

Load the environment and switch to Java 17:

source /home/ubuntu/disk1/opt/conf/pixels-delta-env.sh
export JAVA_HOME="$JAVA17_HOME"

Check the Spark version:

$SPARK_HOME/bin/spark-submit --version

Typical Spark job submission:

$SPARK_HOME/bin/spark-submit \
  --master local[4] \
  --driver-memory 20g \
  --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
  --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \
  --class io.pixelsdb.spark.app.PixelsBenchmarkDeltaImportApp \
  /home/ubuntu/disk1/projects/pixels-spark/target/pixels-spark-0.1.jar \
  /home/ubuntu/disk1/hybench_sf10 \
  s3a://home-zinuo/deltalake/hybench_sf10 \
  local[4] \
  customer

5. Query with Trino CLI

After Trino starts, connect to the local coordinator:

/home/ubuntu/disk1/opt/trino-cli/trino --server 127.0.0.1:8080

Run a single SQL statement:

/home/ubuntu/disk1/opt/trino/bin/trino \
  --server 127.0.0.1:8080 \
  --execute "SHOW CATALOGS"

List schemas in the Delta Lake catalog:

/home/ubuntu/disk1/opt/trino/bin/trino \
  --server 127.0.0.1:8080 \
  --execute "SHOW SCHEMAS FROM delta_lake"

List tables in a schema:

/home/ubuntu/disk1/opt/trino/bin/trino \
  --server 127.0.0.1:8080 \
  --execute "SHOW TABLES FROM delta_lake.hybench_sf10"

Query a table directly:

/home/ubuntu/disk1/opt/trino/bin/trino \
  --server 127.0.0.1:8080 \
  --execute "SELECT * FROM delta_lake.hybench_sf10.customer LIMIT 10"

Inside the interactive CLI:

USE delta_lake.hybench_sf10;
SHOW TABLES;
SELECT count(*) FROM customer;
SELECT * FROM savingaccount LIMIT 10;

6. Optional: Start Spark Standalone

If you really want standalone master / worker processes first:

source /home/ubuntu/disk1/opt/conf/pixels-delta-env.sh
export JAVA_HOME="$JAVA17_HOME"

$SPARK_HOME/sbin/start-master.sh
$SPARK_HOME/sbin/start-worker.sh spark://127.0.0.1:7077

Common checks:

ss -ltn | rg ':7077|:8081'

7. Most Common Startup Commands

For this project, the usual entry points are:

/home/ubuntu/disk1/opt/run/start-metastore.sh
/home/ubuntu/disk1/opt/run/start-trino.sh
source /home/ubuntu/disk1/opt/conf/pixels-delta-env.sh && export JAVA_HOME="$JAVA17_HOME"

Then run the actual Spark job scripts, for example:

./scripts/run-import-hybench-sf10.sh
./scripts/run-cdc-hybench-sf10.sh
./scripts/status-cdc-hybench-sf10.sh
./scripts/stop-cdc-hybench-sf10.sh

8. Start CDC and Monitoring

Start the local dependency stack first:

./scripts/start-local-cdc-stack.sh

This script checks and starts, when needed:

  • HMS
  • Trino
  • Pixels metadata
  • optional Pixels RPC
  • Spark History Server

Start the full sf10 CDC workload:

./scripts/run-cdc-hybench-sf10.sh

This starts one independent Spark CDC job per table.

Start metric collection:

./scripts/collect-cdc-metrics.sh

Run metrics collection for a specific profile:

PROFILE=hybench_sf10 ./scripts/collect-cdc-metrics.sh
PROFILE=hybench_sf1000 ./scripts/collect-cdc-metrics.sh
PROFILE=chbenchmark_w10000 ./scripts/collect-cdc-metrics.sh

PROFILE is case-insensitive and normalizes separators, for example:

  • hybench_sf10
  • HyBench SF1000
  • CHBENCHMARK-WH10000

Start the read-only monitoring page:

python3 ./scripts/cdc_web_monitor.py

Default monitoring URL:

http://127.0.0.1:8084

Raw JSON endpoint:

http://127.0.0.1:8084/api/status

9. What the Monitor Shows

The monitor shows two kinds of information.

Service status:

  • HMS
  • Trino
  • Pixels Metadata
  • Pixels RPC
  • Spark History

Job status:

  • per-table running / stopped
  • PID
  • per-job CPU%
  • per-job RSS memory
  • uptime
  • latest log summary

Overall system metrics come from collect-cdc-metrics.sh:

  • load1
  • mem_used_mb
  • mem_avail_mb
  • disk_used_pct
  • net_rx_mbps
  • net_tx_mbps
  • disk_read_mbps
  • disk_write_mbps

So the System panel at the top is machine-wide information, not only one Spark process.

Metric file locations:

  • system CSV: /tmp/hybench_sf10_cdc_metrics/system.csv
  • resource CSV: /home/ubuntu/disk1/projects/pixels-spark/data/hybench/sf10/resource/resource_cdc.csv
  • per-table JSON: /tmp/hybench_sf10_cdc_metrics/<table>.json
  • per-table history CSV: /tmp/hybench_sf10_cdc_metrics/<table>.csv

The resource CSV follows the same shape as files such as resource_iceberg.csv, with this header:

time,cpu,jvm_heap,jvm_managed,jvm_direct,jvm_noheap,net_rx_mbps,net_tx_mbps,disk_read_mbps,disk_write_mbps

By default:

  • cpu: summed CPU across all CDC Spark JVMs
  • jvm_heap: summed used heap across all CDC Spark JVMs
  • jvm_managed: summed -Xmx across all CDC Spark JVMs
  • jvm_direct: currently 0 MiB, because JVM Native Memory Tracking is not enabled
  • jvm_noheap: summed Metaspace + class space usage
  • net_rx_mbps / net_tx_mbps: receive/transmit throughput on the primary network interface, in Mbps
  • disk_read_mbps / disk_write_mbps: read/write throughput on the main disk backing pixels.tmp.root, in Mbps

Optional config:

  • pixels.cdc.network-interface
  • pixels.cdc.disk-device

Both default to auto. In auto mode, the collector picks the default-route interface and the disk backing the mount used by pixels.tmp.root.

Related logs:

  • CDC job logs: /tmp/hybench_sf10_cdc_logs/<table>.log
  • web monitor log: /tmp/hybench_sf10_cdc_web.log

If you want a CLI view of whole-machine CPU and memory, you can also use:

top
htop
pidstat -r -u -d 1