This document records the actual local commands used to start Spark, Trino, and HMS in the current environment.
Load the shared environment first:
source /home/ubuntu/disk1/opt/conf/pixels-delta-env.shUse the existing script:
/home/ubuntu/disk1/opt/run/start-metastore.shThis script:
- loads
pixels-delta-env.sh - sets
JAVA_HOME=$JAVA11_HOME - exports AWS credentials
- starts HMS in the background
Common checks:
cat /home/ubuntu/disk1/opt/run/metastore.pid
ss -ltn | rg ':9083'
tail -f /home/ubuntu/disk1/opt/logs/metastore.outUse the existing script:
/home/ubuntu/disk1/opt/run/start-trino.shThis script:
- loads
pixels-delta-env.sh - sets
JAVA_HOME=$JAVA23_HOME - starts Trino in the background
Common checks:
cat /home/ubuntu/disk1/opt/run/trino.pid
ss -ltn | rg ':8080'
tail -f /home/ubuntu/disk1/opt/logs/trino.outBefore Trino can query Delta tables on S3, confirm that the delta_lake catalog has S3 settings.
Live file:
/home/ubuntu/disk1/opt/trino-server-466/etc/catalog/delta_lake.propertiesRepository template:
./etc/trino-delta_lake.properties.exampleMinimal configuration:
connector.name=delta_lake
hive.metastore.uri=thrift://127.0.0.1:9083
delta.register-table-procedure.enabled=true
delta.enable-non-concurrent-writes=true
fs.native-s3.enabled=true
s3.aws-access-key=YOUR_AWS_ACCESS_KEY_ID
s3.aws-secret-key=YOUR_AWS_SECRET_ACCESS_KEY
s3.region=us-east-2
s3.endpoint=https://s3.us-east-2.amazonaws.comRestart Trino after editing delta_lake.properties:
pkill -f 'trino-server-466' || true
/home/ubuntu/disk1/opt/run/start-trino.shThis project usually does not run a standalone Spark service first. It normally submits jobs directly.
Load the environment and switch to Java 17:
source /home/ubuntu/disk1/opt/conf/pixels-delta-env.sh
export JAVA_HOME="$JAVA17_HOME"Check the Spark version:
$SPARK_HOME/bin/spark-submit --versionTypical Spark job submission:
$SPARK_HOME/bin/spark-submit \
--master local[4] \
--driver-memory 20g \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \
--class io.pixelsdb.spark.app.PixelsBenchmarkDeltaImportApp \
/home/ubuntu/disk1/projects/pixels-spark/target/pixels-spark-0.1.jar \
/home/ubuntu/disk1/hybench_sf10 \
s3a://home-zinuo/deltalake/hybench_sf10 \
local[4] \
customerAfter Trino starts, connect to the local coordinator:
/home/ubuntu/disk1/opt/trino-cli/trino --server 127.0.0.1:8080Run a single SQL statement:
/home/ubuntu/disk1/opt/trino/bin/trino \
--server 127.0.0.1:8080 \
--execute "SHOW CATALOGS"List schemas in the Delta Lake catalog:
/home/ubuntu/disk1/opt/trino/bin/trino \
--server 127.0.0.1:8080 \
--execute "SHOW SCHEMAS FROM delta_lake"List tables in a schema:
/home/ubuntu/disk1/opt/trino/bin/trino \
--server 127.0.0.1:8080 \
--execute "SHOW TABLES FROM delta_lake.hybench_sf10"Query a table directly:
/home/ubuntu/disk1/opt/trino/bin/trino \
--server 127.0.0.1:8080 \
--execute "SELECT * FROM delta_lake.hybench_sf10.customer LIMIT 10"Inside the interactive CLI:
USE delta_lake.hybench_sf10;
SHOW TABLES;
SELECT count(*) FROM customer;
SELECT * FROM savingaccount LIMIT 10;If you really want standalone master / worker processes first:
source /home/ubuntu/disk1/opt/conf/pixels-delta-env.sh
export JAVA_HOME="$JAVA17_HOME"
$SPARK_HOME/sbin/start-master.sh
$SPARK_HOME/sbin/start-worker.sh spark://127.0.0.1:7077Common checks:
ss -ltn | rg ':7077|:8081'For this project, the usual entry points are:
/home/ubuntu/disk1/opt/run/start-metastore.sh
/home/ubuntu/disk1/opt/run/start-trino.sh
source /home/ubuntu/disk1/opt/conf/pixels-delta-env.sh && export JAVA_HOME="$JAVA17_HOME"Then run the actual Spark job scripts, for example:
./scripts/run-import-hybench-sf10.sh
./scripts/run-cdc-hybench-sf10.sh
./scripts/status-cdc-hybench-sf10.sh
./scripts/stop-cdc-hybench-sf10.shStart the local dependency stack first:
./scripts/start-local-cdc-stack.shThis script checks and starts, when needed:
- HMS
- Trino
- Pixels metadata
- optional Pixels RPC
- Spark History Server
Start the full sf10 CDC workload:
./scripts/run-cdc-hybench-sf10.shThis starts one independent Spark CDC job per table.
Start metric collection:
./scripts/collect-cdc-metrics.shRun metrics collection for a specific profile:
PROFILE=hybench_sf10 ./scripts/collect-cdc-metrics.sh
PROFILE=hybench_sf1000 ./scripts/collect-cdc-metrics.sh
PROFILE=chbenchmark_w10000 ./scripts/collect-cdc-metrics.shPROFILE is case-insensitive and normalizes separators, for example:
hybench_sf10HyBench SF1000CHBENCHMARK-WH10000
Start the read-only monitoring page:
python3 ./scripts/cdc_web_monitor.pyDefault monitoring URL:
http://127.0.0.1:8084
Raw JSON endpoint:
http://127.0.0.1:8084/api/status
The monitor shows two kinds of information.
Service status:
- HMS
- Trino
- Pixels Metadata
- Pixels RPC
- Spark History
Job status:
- per-table
running/stopped - PID
- per-job CPU%
- per-job RSS memory
- uptime
- latest log summary
Overall system metrics come from collect-cdc-metrics.sh:
load1mem_used_mbmem_avail_mbdisk_used_pctnet_rx_mbpsnet_tx_mbpsdisk_read_mbpsdisk_write_mbps
So the System panel at the top is machine-wide information, not only one Spark process.
Metric file locations:
- system CSV:
/tmp/hybench_sf10_cdc_metrics/system.csv - resource CSV:
/home/ubuntu/disk1/projects/pixels-spark/data/hybench/sf10/resource/resource_cdc.csv - per-table JSON:
/tmp/hybench_sf10_cdc_metrics/<table>.json - per-table history CSV:
/tmp/hybench_sf10_cdc_metrics/<table>.csv
The resource CSV follows the same shape as files such as resource_iceberg.csv, with this header:
time,cpu,jvm_heap,jvm_managed,jvm_direct,jvm_noheap,net_rx_mbps,net_tx_mbps,disk_read_mbps,disk_write_mbpsBy default:
cpu: summed CPU across all CDC Spark JVMsjvm_heap: summed used heap across all CDC Spark JVMsjvm_managed: summed-Xmxacross all CDC Spark JVMsjvm_direct: currently0 MiB, because JVM Native Memory Tracking is not enabledjvm_noheap: summed Metaspace + class space usagenet_rx_mbps/net_tx_mbps: receive/transmit throughput on the primary network interface, in Mbpsdisk_read_mbps/disk_write_mbps: read/write throughput on the main disk backingpixels.tmp.root, in Mbps
Optional config:
pixels.cdc.network-interfacepixels.cdc.disk-device
Both default to auto. In auto mode, the collector picks the default-route interface and the disk backing the mount used by pixels.tmp.root.
Related logs:
- CDC job logs:
/tmp/hybench_sf10_cdc_logs/<table>.log - web monitor log:
/tmp/hybench_sf10_cdc_web.log
If you want a CLI view of whole-machine CPU and memory, you can also use:
top
htop
pidstat -r -u -d 1