Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,8 @@ Apache DataFusion Comet is a high-performance accelerator for Apache Spark, buil
performance of Apache Spark workloads while leveraging commodity hardware and seamlessly integrating with the
Spark ecosystem without requiring any code changes.

Comet also accelerates Apache Iceberg, when performing Parquet scans from Spark.

[Apache DataFusion]: https://datafusion.apache.org

# Benefits of Using Comet
Expand Down
4 changes: 2 additions & 2 deletions docs/source/contributor-guide/benchmarking_aws_ec2.md
Original file line number Diff line number Diff line change
Expand Up @@ -179,8 +179,8 @@ $SPARK_HOME/bin/spark-submit \
Install Comet JAR from Maven:

```shell
wget https://repo1.maven.org/maven2/org/apache/datafusion/comet-spark-spark3.5_2.12/0.7.0/comet-spark-spark3.5_2.12-0.7.0.jar -P $SPARK_HOME/jars
export COMET_JAR=$SPARK_HOME/jars/comet-spark-spark3.5_2.12-0.7.0.jar
wget https://repo1.maven.org/maven2/org/apache/datafusion/comet-spark-spark3.5_2.12/0.9.0/comet-spark-spark3.5_2.12-0.9.0.jar -P $SPARK_HOME/jars
export COMET_JAR=$SPARK_HOME/jars/comet-spark-spark3.5_2.12-0.9.0.jar
```

Run the following command (the `--data` parameter will need to be updated to point to your S3 bucket):
Expand Down
3 changes: 2 additions & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,6 @@ as a native runtime to achieve improvement in terms of query efficiency and quer
Comet Overview <user-guide/overview>
Installing Comet <user-guide/installation>
Building From Source <user-guide/source>
Kubernetes Guide <user-guide/kubernetes>
Supported Data Sources <user-guide/datasources>
Supported Data Types <user-guide/datatypes>
Supported Operators <user-guide/operators>
Expand All @@ -52,6 +51,8 @@ as a native runtime to achieve improvement in terms of query efficiency and quer
Compatibility Guide <user-guide/compatibility>
Tuning Guide <user-guide/tuning>
Metrics Guide <user-guide/metrics>
Iceberg Guide <user-guide/iceberg>
Kubernetes Guide <user-guide/kubernetes>

.. _toc.contributor-guide-links:
.. toctree::
Expand Down
228 changes: 112 additions & 116 deletions docs/source/user-guide/compatibility.md

Large diffs are not rendered by default.

8 changes: 7 additions & 1 deletion docs/source/user-guide/datasources.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,12 @@ in the schema are supported. When this option is not enabled, the scan will fall
enabling `spark.comet.convert.parquet.enabled` will immediately convert the data into Arrow format, allowing native
execution to happen after that, but the process may not be efficient.

### Apache Iceberg

Comet accelerates Iceberg scans of Parquet files. See the [Iceberg Guide] for more information.

[Iceberg Guide]: iceberg.md

### CSV

Comet does not provide native CSV scan, but when `spark.comet.convert.csv.enabled` is enabled, data is immediately
Expand Down Expand Up @@ -88,7 +94,7 @@ root
| |-- lastName: string (nullable = true)
| |-- ageInYears: integer (nullable = true)

25/01/30 16:50:43 INFO core/src/lib.rs: Comet native library version 0.7.0 initialized
25/01/30 16:50:43 INFO core/src/lib.rs: Comet native library version 0.9.0 initialized
== Physical Plan ==
* CometColumnarToRow (2)
+- CometNativeScan: (1)
Expand Down
6 changes: 4 additions & 2 deletions docs/source/user-guide/datatypes.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,5 +39,7 @@ The following Spark data types are currently available:
- Timestamp
- TimestampNTZ
- Null
- Struct
- Array
- Complex Types
- Struct
- Array
- Map
36 changes: 15 additions & 21 deletions docs/source/user-guide/iceberg.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,22 +44,13 @@ export COMET_JAR=`pwd`/spark/target/comet-spark-spark3.5_2.12-0.10.0-SNAPSHOT.ja

## Build Iceberg

Clone the Iceberg repository.
Clone the Iceberg repository and apply code changes needed by Comet

```shell
git clone git@github.com:apache/iceberg.git
```

It will be necessary to make some small changes to Iceberg:

- Update Gradle files to change Comet version to `0.10.0-SNAPSHOT`.
- Replace `import org.apache.comet.shaded.arrow.c.CometSchemaImporter;` with `import org.apache.comet.CometSchemaImporter;`
- Modify `SparkBatchQueryScan` so that it implements the `SupportsComet` interface
- Stop shading Parquet by commenting out the following lines in the iceberg-spark build:

```
// relocate 'org.apache.parquet', 'org.apache.iceberg.shaded.org.apache.parquet'
// relocate 'shaded.parquet', 'org.apache.iceberg.shaded.org.apache.parquet.shaded'
cd iceberg
git checkout apache-iceberg-1.8.1
git apply ../datafusion-comet/dev/diffs/iceberg/1.8.1.diff
```

Perform a clean build
Expand All @@ -74,7 +65,7 @@ Perform a clean build
Set `ICEBERG_JAR` environment variable.

```shell
export ICEBERG_JAR=`pwd`/spark/v3.5/spark-runtime/build/libs/iceberg-spark-runtime-3.5_2.12-1.10.0-SNAPSHOT.jar
export ICEBERG_JAR=`pwd`/spark/v3.5/spark-runtime/build/libs/iceberg-spark-runtime-3.5_2.12-1.9.0-SNAPSHOT.jar
```

Launch Spark Shell:
Expand All @@ -93,7 +84,7 @@ $SPARK_HOME/bin/spark-shell \
--conf spark.sql.iceberg.parquet.reader-type=COMET \
--conf spark.comet.explainFallback.enabled=true \
--conf spark.memory.offHeap.enabled=true \
--conf spark.memory.offHeap.size=16g
--conf spark.memory.offHeap.size=2g
```

Create an Iceberg table. Note that Comet will not accelerate this part.
Expand All @@ -113,12 +104,6 @@ This should produce the following output:

```
scala> spark.sql(s"SELECT * from t1").show()
25/04/28 07:29:37 INFO core/src/lib.rs: Comet native library version 0.9.0 initialized
25/04/28 07:29:37 WARN CometSparkSessionExtensions$CometExecRule: Comet cannot execute some parts of this plan natively (set spark.comet.explainFallback.enabled=false to disable this logging):
CollectLimit
+- Project [COMET: toprettystring is not supported]
+- CometScanWrapper

+---+---+
| c0| c1|
+---+---+
Expand All @@ -145,3 +130,12 @@ scala> spark.sql(s"SELECT * from t1").show()
+---+---+
only showing top 20 rows
```

Confirm that the query was accelerated by Comet:

```
scala> spark.sql(s"SELECT * from t1").explain()
== Physical Plan ==
*(1) CometColumnarToRow
+- CometBatchScan spark_catalog.default.t1[c0#26, c1#27] spark_catalog.default.t1 (branch=null) [filters=, groupedBy=] RuntimeFilters: []
```
2 changes: 1 addition & 1 deletion docs/source/user-guide/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ use only and should not be used in production yet.

| Spark Version | Java Version | Scala Version | Comet Tests in CI | Spark SQL Tests in CI |
| -------------- | ------------ | ------------- | ----------------- |-----------------------|
| 4.0.0-preview1 | 17 | 2.13 | Yes | Yes |
| 4.0.0 | 17 | 2.13 | Yes | Yes |

Note that Comet may not fully work with proprietary forks of Apache Spark such as the Spark versions offered by
Cloud Service Providers.
Expand Down
8 changes: 0 additions & 8 deletions docs/source/user-guide/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,14 +30,6 @@ The following diagram provides an overview of Comet's architecture.

![Comet Overview](../_static/images/comet-overview.png)

Comet aims to support:

- a native Parquet implementation, including both reader and writer
- full implementation of Spark operators, including
Filter/Project/Aggregation/Join/Exchange etc.
- full implementation of Spark built-in expressions.
- a UDF framework for users to migrate their existing UDF to native

## Architecture

The following diagram shows how Comet integrates with Apache Spark.
Expand Down
2 changes: 1 addition & 1 deletion docs/source/user-guide/source.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ Official source releases can be downloaded from https://dist.apache.org/repos/di

```console
# Pick the latest version
export COMET_VERSION=0.7.0
export COMET_VERSION=0.9.0
# Download the tarball
curl -O "https://dist.apache.org/repos/dist/release/datafusion/datafusion-comet-$COMET_VERSION/apache-datafusion-comet-$COMET_VERSION.tar.gz"
# Unpack
Expand Down
12 changes: 0 additions & 12 deletions docs/source/user-guide/tuning.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,18 +21,6 @@ under the License.

Comet provides some tuning options to help you get the best performance from your queries.

## Parquet Scans

Comet currently has three distinct implementations of the Parquet scan operator. The configuration property
`spark.comet.scan.impl` is used to select an implementation. These scans are described in the
[Compatibility Guide].

[Compatibility Guide]: compatibility.md

When using `native_datafusion` or `native_iceberg_compat`, there are known performance issues when pushing filters
down to Parquet scans. Until this issue is resolved, performance can be improved by setting
`spark.sql.parquet.filterPushdown=false`.
Comment on lines -24 to -34
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer need this section because of the new config COMET_RESPECT_PARQUET_FILTER_PUSHDOWN and users should not need to know the details of this.


## Configuring Tokio Runtime

Comet uses a global tokio runtime per executor process using tokio's defaults of one worker thread per core and a
Expand Down