Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 23 additions & 11 deletions .github/workflows/rust.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,24 +2,36 @@ name: Rust

on:
push:
branches: [ "main" ]
pull_request:
branches: [ "main" ]

env:
CARGO_TERM_COLOR: always
PYTHON_VERSION: 3.9
TPCH_SCALING_FACTOR: "1"
TPCH_TEST_PARTITIONS: "1"
TPCH_DATA_PATH: "data"

jobs:
build:

runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v3
- name: Install protobuf compiler
shell: bash
run: sudo apt-get install protobuf-compiler
- name: Build Rust code
run: cargo build --verbose
- name: Run tests
run: cargo test --verbose
- uses: actions/checkout@v3
- name: Install protobuf compiler
shell: bash
run: sudo apt-get install protobuf-compiler
- name: Build Rust code
run: cargo build --verbose
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: ${{ env.PYTHON_VERSION }}
- name: Install test dependencies
run: |
python -m pip install --upgrade pip
pip install -r tpch/requirements.txt
- name: Generate test data
run: |
./scripts/gen-test-data.sh
- name: Run tests
run: cargo test --verbose
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,5 @@ venv
*.so
*.log
results-sf*
data
tpch/tpch-dbgen
37 changes: 31 additions & 6 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

7 changes: 6 additions & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,11 @@ uuid = "1.2"
rustc_version = "0.4.0"
tonic-build = { version = "0.8", default-features = false, features = ["transport", "prost"] }

[dev-dependencies]
anyhow = "1.0.89"
pretty_assertions = "1.4.0"
regex = "1.11.0"

[lib]
name = "datafusion_ray"
crate-type = ["cdylib", "rlib"]
Expand All @@ -54,4 +59,4 @@ name = "datafusion_ray._datafusion_ray_internal"

[profile.release]
codegen-units = 1
lto = true
lto = true
40 changes: 34 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,8 @@

# DataFusion on Ray

> This was originally a research project donated from [ray-sql](https://github.com/datafusion-contrib/ray-sql) to evaluate performing distributed SQL queries from Python, using
[Ray](https://www.ray.io/) and [DataFusion](https://github.com/apache/arrow-datafusion).
> This was originally a research project donated from [ray-sql](https://github.com/datafusion-contrib/ray-sql) to evaluate performing distributed SQL queries from Python, using
> [Ray](https://www.ray.io/) and [DataFusion](https://github.com/apache/arrow-datafusion).

DataFusion Ray is a distributed SQL query engine powered by the Rust implementation of [Apache Arrow](https://arrow.apache.org/), [Apache DataFusion](https://datafusion.apache.org/) and [Ray](https://www.ray.io/).

Expand All @@ -33,7 +33,7 @@ DataFusion Ray is a distributed SQL query engine powered by the Rust implementat

## Non Goals

- Re-build the cluster scheduling systems like what [Ballista](https://datafusion.apache.org/ballista/) did.
- Re-build the cluster scheduling systems like what [Ballista](https://datafusion.apache.org/ballista/) did.
- Ballista is extremely complex and utilizing Ray feels like it abstracts some of that complexity away.
- Datafusion Ray is delegating cluster management to Ray.

Expand Down Expand Up @@ -120,10 +120,38 @@ python -m pip install -r requirements-in.txt

Whenever rust code changes (your changes or via `git pull`):

```bash
````bash
# make sure you activate the venv using "source venv/bin/activate" first
maturin develop
python -m pytest
maturin develop python -m pytest ```


## Testing

Running local Rust tests require generating the tpch-data. This can be done
by running the following command:

```bash
./scripts/generate_tpch_data.sh
```

Tests compare plans with expected plans, which unfortunately contain the
path to the parquet tables. The path committed under version control is
the one of a Github Runner, and won't work locally. You can fix it by
running the following command:

```bash
./scripts/replace-expected-plan-paths.sh local-dev
````

When instead you need to regenerate the plans, which you can do by
re-running the planner tests removing all the content of
`testdata/expected-plans`, they will now contain your local paths. You can
fix it before committing the plans running

```bash

./scripts/replace-expected-plan-paths.sh pre-ci

```

## Benchmarking
Expand Down
60 changes: 60 additions & 0 deletions scripts/gen-test-data.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
#!/bin/bash

set -e

create_directories() {
mkdir -p data
}

clone_and_build_tpch_dbgen() {
if [ -z "$(ls -A tpch/tpch-dbgen)" ]; then
echo "tpch/tpch-dbgen folder is empty. Cloning repository..."
git clone https://github.com/databricks/tpch-dbgen.git tpch/tpch-dbgen
cd tpch/tpch-dbgen
make
cd ../../
else
echo "tpch/tpch-dbgen folder is not empty. Skipping cloning of TPCH dbgen."
fi
}

generate_data() {
cd tpch/tpch-dbgen
if [ "$TPCH_TEST_PARTITIONS" -gt 1 ]; then
for i in $(seq 1 "$TPCH_TEST_PARTITIONS"); do
./dbgen -f -s "$TPCH_SCALING_FACTOR" -C "$TPCH_TEST_PARTITIONS" -S "$i"
done
else
./dbgen -f -s "$TPCH_SCALING_FACTOR"
fi
mv ./*.tbl* ../../data
}

convert_data() {
cd ../../
python -m tpch.tpchgen convert --partitions "$TPCH_TEST_PARTITIONS"
}

main() {
if [ -z "$TPCH_TEST_PARTITIONS" ]; then
echo "Error: TPCH_TEST_PARTITIONS is not set."
exit 1
fi

if [ -z "$TPCH_SCALING_FACTOR" ]; then
echo "Error: TPCH_SCALING_FACTOR is not set."
exit 1
fi

create_directories

if [ -z "$(ls -A data)" ]; then
clone_and_build_tpch_dbgen
generate_data
convert_data
else
echo "Data folder is not empty. Skipping cloning and data generation."
fi
}

main
44 changes: 44 additions & 0 deletions scripts/replace-expected-plans-paths.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
#!/bin/bash

# This script helps change the path to parquet files in expected plans for
# local development and CI

set -e

if [ "$#" -ne 1 ]; then
echo "Usage: $0 <mode>"
echo "Modes: pre-ci, local-dev"
exit 1
fi

# Assign the parameter to the mode variable
mode=$1

ci_dir="home/runner/work/datafusion-ray/datafusion-ray"
current_dir=$(pwd)
current_dir_no_leading_slash="${current_dir#/}"
expected_plans_dir="./testdata/expected-plans"

# Function to replace paths in files
replace_paths() {
local search=$1
local replace=$2
find "$expected_plans_dir" -type f -exec sed -i "s|$search|$replace|g" {} +
echo "Replaced all occurrences of '$search' with '$replace' in files within '$expected_plans_dir'."
}

# Handle the modes
case $mode in
pre-ci)
replace_paths "$current_dir_no_leading_slash" "$ci_dir"
;;
local-dev)
replace_paths "$ci_dir" "$current_dir_no_leading_slash"
;;
*)
echo "Invalid mode: $mode"
echo "Usage: $0 <mode>"
echo "Modes: pre-ci, local-dev"
exit 1
;;
esac
Loading