Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion rust/arrow/src/json/writer.rs
Original file line number Diff line number Diff line change
Expand Up @@ -329,7 +329,7 @@ fn set_column_for_json_rows(
}

/// Converts an arrow [`RecordBatch`] into a `Vec` of Serde JSON
/// [`serde_json::map::JsonMap`]s (objects)
/// [`JsonMap`]s (objects)
pub fn record_batches_to_json_rows(
batches: &[RecordBatch],
) -> Vec<JsonMap<String, Value>> {
Expand Down
63 changes: 63 additions & 0 deletions rust/datafusion/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,69 @@ Here are some of the projects known to use DataFusion:

(if you know of another project, please submit a PR to add a link!)

## Example Usage

Run a SQL query against data stored in a CSV:

```rust
use datafusion::prelude::*;
use arrow::util::pretty::print_batches;
use arrow::record_batch::RecordBatch;

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
// create the dataframe
let mut ctx = ExecutionContext::new();
let df = ctx.read_csv("tests/example.csv", CsvReadOptions::new())?;

let mut ctx = ExecutionContext::new();
ctx.register_csv("example", "tests/example.csv", CsvReadOptions::new())?;

// create a plan to run a SQL query
let df = ctx.sql("SELECT a, MIN(b) FROM example GROUP BY a LIMIT 100")?;

// execute and print results
let results: Vec<RecordBatch> = df.collect().await?;
print_batches(&results)?;
Ok(())
}
```

Use the DataFrame API to process data stored in a CSV:

```rust
Copy link
Contributor

@returnString returnString Mar 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicking: this might be a little bit fun with API churn, e.g. I believe the input expr ownership work you've recently opened would change these from slices to vecs and we don't have a way to catch that automatically like we do for the in-crate docs (am I right in thinking that cargo test runs all doctests?).

Edit: to be clear, I don't think it's a reason to not do it, just curious if anyone has ideas for how to prevent doc drift :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point @returnString

The way I justified the danger of drift to myself was "the main usecase of this documentation (the overview) is likely to help them answer the question of "should I even bother to try and use this crate". Once they decide to try and actually use the crate they will look at the real docs on docs.rs (from which they can copy/paste).

For the purpose of an example of "what does this library do" I felt even a slightly out of date example might be valuable.

Or maybe I am just trying to pad my github stats ;) But in all seriousness I am not committed to this PR. If it isn't a good idea I can just close it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense to me; agreed that (personally, at least) I'll give less consideration to projects without simple readme examples.

It balloons the scope of this PR quite a lot so I'm not saying this is a good idea, but I just did a bit of digging and it looks like people have gone through this particular problem before: https://blog.guillaume-gomez.fr/articles/2019-04-13+Keeping+Rust+projects%27+README.md+code+examples+up-to-date

And the end result of that is https://crates.io/crates/doc-comment, which looks like it'll wire up any rust-tagged code blocks in external files as doctests, optionally only for #[cfg(test)].

If it's useful, I could log a followup task to integrate that and take a look at it myself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@returnString https://crates.io/crates/doc-comment looks super awesome -- I think that would be most helpful

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use datafusion::prelude::*;
use arrow::util::pretty::print_batches;
use arrow::record_batch::RecordBatch;

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
// create the dataframe
let mut ctx = ExecutionContext::new();
let df = ctx.read_csv("tests/example.csv", CsvReadOptions::new())?;

let df = df.filter(col("a").lt_eq(col("b")))?
.aggregate(&[col("a")], &[min(col("b"))])?
.limit(100)?;

// execute and print results
let results: Vec<RecordBatch> = df.collect().await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed when using this as a test for #9749 that we're collecting twice here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol "testing for the win!"

print_batches(&results)?;
Ok(())
}
```

Both of these examples will produce

```text
+---+--------+
| a | MIN(b) |
+---+--------+
| 1 | 2 |
+---+--------+
```



## Using DataFusion as a library

Expand Down
29 changes: 28 additions & 1 deletion rust/datafusion/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,8 @@
//! as well as a query optimizer and execution engine capable of parallel execution
//! against partitioned data sources (CSV and Parquet) using threads.
//!
//! Below is an example of how to execute a query against a CSV using [`DataFrames`](dataframe::DataFrame):
//! Below is an example of how to execute a query against data stored
//! in a CSV file using a [`DataFrame`](dataframe::DataFrame):
//!
//! ```rust
//! # use datafusion::prelude::*;
Expand All @@ -52,6 +53,19 @@
//!
//! // execute the plan
//! let results: Vec<RecordBatch> = df.collect().await?;
//!
//! // format the results
//! let pretty_results = arrow::util::pretty::pretty_format_batches(&results)?;
//!
//! let expected = vec![
//! "+---+--------+",
//! "| a | MIN(b) |",
//! "+---+--------+",
//! "| 1 | 2 |",
//! "+---+--------+"
//! ];
//!
//! assert_eq!(pretty_results.trim().lines().collect::<Vec<_>>(), expected);
//! # Ok(())
//! # }
//! ```
Expand All @@ -74,6 +88,19 @@
//!
//! // execute the plan
//! let results: Vec<RecordBatch> = df.collect().await?;
//!
//! // format the results
//! let pretty_results = arrow::util::pretty::pretty_format_batches(&results)?;
//!
//! let expected = vec![
//! "+---+--------+",
//! "| a | MIN(b) |",
//! "+---+--------+",
//! "| 1 | 2 |",
//! "+---+--------+"
//! ];
//!
//! assert_eq!(pretty_results.trim().lines().collect::<Vec<_>>(), expected);
//! # Ok(())
//! # }
//! ```
Expand Down
5 changes: 3 additions & 2 deletions rust/datafusion/src/physical_plan/regex_expressions.rs
Original file line number Diff line number Diff line change
Expand Up @@ -54,8 +54,9 @@ fn regex_replace_posix_groups(replacement: &str) -> String {
.into_owned()
}

/// Replaces substring(s) matching a POSIX regular expression
/// regexp_replace('Thomas', '.[mN]a.', 'M') = 'ThM'
/// Replaces substring(s) matching a POSIX regular expression.
///
/// example: `regexp_replace('Thomas', '.[mN]a.', 'M') = 'ThM'`
pub fn regexp_replace<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<ArrayRef> {
// creating Regex is expensive so create hashmap for memoization
let mut patterns: HashMap<String, Regex> = HashMap::new();
Expand Down