[Datasets] Update docs for drop_columns and fix typos by c21 · Pull Request #26317 · ray-project/ray

c21 · 2022-07-06T05:49:43Z

Why are these changes needed?

We added drop_columns() API to datasets in #26200, so updating documentation here to use the new API - doc/source/data/examples/nyc_taxi_basic_processing.ipynb. In addition, fixing some minor typos after proofreading the datasets documentation.

Related issue number

Closes #26113

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

c21 · 2022-07-06T05:53:39Z

doc/source/data/performance-tips.rst

 ~~~~~~~~~~~~~~~~~~~

-Similarly, you can pass in a filter to ``ray.data.read_parquet()`` (selection pushdown)
+Similarly, you can pass in a filter to ``ray.data.read_parquet()`` (filter pushdown)


selection pushdown is confusing in data world, as it normally means projection to me. Other systems (such as Spark, Presto, Parquet, etc) are always using filter pushdown. We are also using filter pushdown in other places e.g. here

Yeah, "selection" means differently in SQL v.s. relational algebra. Using "filter" seems a good choice.

Agreed here!

jianoaix

nit: there seems a typo in API comment (closing bracket): https://sourcegraph.com/github.com/ray-project/ray@master/-/blob/python/ray/data/dataset.py?L564

jianoaix · 2022-07-06T16:48:10Z

doc/source/data/performance-tips.rst

 ~~~~~~~~~~~~~~~~~~~

-Similarly, you can pass in a filter to ``ray.data.read_parquet()`` (selection pushdown)
+Similarly, you can pass in a filter to ``ray.data.read_parquet()`` (filter pushdown)


Yeah, "selection" means differently in SQL v.s. relational algebra. Using "filter" seems a good choice.

c21 · 2022-07-06T17:28:51Z

nit: there seems a typo in API comment (closing bracket): https://sourcegraph.com/github.com/ray-project/ray@master/-/blob/python/ray/data/dataset.py?L564

@jianoaix - ah good catch, fixed it.

jianoaix · 2022-07-07T23:47:17Z

@clarkzinzow review or merge?

clarkzinzow

LGTM, and many thanks for the drivebys!

clarkzinzow · 2022-07-08T00:16:01Z

doc/source/data/creating-datasets.rst

  :ref:`tensor data guide <datasets_tensor_support>` for more information on working
  with tensors in Datasets. Although this simple example demonstrates reading a single
-  file, note that Datasets can also read directories of JSON files, with one tensor
+  file, note that Datasets can also read directories of NumPy files, with one tensor


Nice catch!

clarkzinzow · 2022-07-08T00:17:01Z

doc/source/data/performance-tips.rst

 ~~~~~~~~~~~~~~~~~~~

-Similarly, you can pass in a filter to ``ray.data.read_parquet()`` (selection pushdown)
+Similarly, you can pass in a filter to ``ray.data.read_parquet()`` (filter pushdown)


Agreed here!

c21 · 2022-07-08T00:20:47Z

Thank you @clarkzinzow and @jianoaix for review!

* master: (42 commits) [dashboard][2/2] Add endpoints to dashboard and dashboard_agent for liveness check of raylet and gcs (ray-project#26408) [Doc] Fix docs feedback button (ray-project#26402) [core][1/2] Improve liveness check in GCS (ray-project#26405) [RLlib] Checkpoint and restore connectors. (ray-project#26253) [Workflow] Minor refactoring of workflow exceptions (ray-project#26398) [workflow] Workflow queue (ray-project#24697) [RLlib] Minor simplification of code. (ray-project#26312) [AIR] Update TensorflowPredictor to new API (ray-project#26215) [RLlib] Make Dataset reader default reader and enable CRR to use dataset (ray-project#26304) [runtime_env] [doc] Remove outdated info about "isolated" environment (ray-project#26314) [Doc] Fix rate-the-docs plugin (ray-project#26384) [Docs] [Serve] Has a consistent landing page style (ray-project#26029) [dashboard] Add `RAY_CLUSTER_ACTIVITY_HOOK` to `/api/component_activities` (ray-project#26297) [tune] Use `Checkpoint.to_bytes()` for store_to_object (ray-project#25805) [tune] Fix `SyncerCallback` having a size limit (ray-project#26371) [air] Serialize additional files in dict checkpoints turned dir checkpoints (ray-project#26351) [Docs] Add "rate the docs" plugin for feedback on docs (ray-project#26330) [Doc] Fix actor example (ray-project#26381) Set RAY_USAGE_STATS_EXTRA_TAGS for release tests (ray-project#26366) [Datasets] Update docs for drop_columns and fix typos (ray-project#26317) ...

) We added drop_columns() API to datasets in ray-project#26200, so updating documentation here to use the new API - doc/source/data/examples/nyc_taxi_basic_processing.ipynb. In addition, fixing some minor typos after proofreading the datasets documentation. Signed-off-by: Stefan van der Kleij <s.vanderkleij@viroteq.com>

Update docs for drop_columns and fix typos

b4344bb

c21 requested review from clarkzinzow, ericl, jianoaix, jjyao, maxpumperla and scv119 as code owners July 6, 2022 05:49

c21 assigned ericl, clarkzinzow and jianoaix Jul 6, 2022

c21 commented Jul 6, 2022

View reviewed changes

jianoaix approved these changes Jul 6, 2022

View reviewed changes

Fix the typo in API doc

4b383ab

jianoaix approved these changes Jul 7, 2022

View reviewed changes

clarkzinzow approved these changes Jul 8, 2022

View reviewed changes

clarkzinzow merged commit 4e674b6 into ray-project:master Jul 8, 2022

c21 deleted the doc branch July 8, 2022 00:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] Update docs for drop_columns and fix typos#26317

[Datasets] Update docs for drop_columns and fix typos#26317
clarkzinzow merged 2 commits intoray-project:masterfrom
c21:doc

c21 commented Jul 6, 2022

Uh oh!

c21 Jul 6, 2022

Uh oh!

jianoaix Jul 6, 2022

Uh oh!

clarkzinzow Jul 8, 2022

Uh oh!

jianoaix left a comment

Uh oh!

jianoaix Jul 6, 2022

Uh oh!

c21 commented Jul 6, 2022

Uh oh!

jianoaix commented Jul 7, 2022

Uh oh!

clarkzinzow left a comment

Uh oh!

clarkzinzow Jul 8, 2022

Uh oh!

clarkzinzow Jul 8, 2022

Uh oh!

c21 commented Jul 8, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

c21 commented Jul 6, 2022

Why are these changes needed?

Related issue number

Checks

Uh oh!

c21 Jul 6, 2022

Choose a reason for hiding this comment

Uh oh!

jianoaix Jul 6, 2022

Choose a reason for hiding this comment

Uh oh!

clarkzinzow Jul 8, 2022

Choose a reason for hiding this comment

Uh oh!

jianoaix left a comment

Choose a reason for hiding this comment

Uh oh!

jianoaix Jul 6, 2022

Choose a reason for hiding this comment

Uh oh!

c21 commented Jul 6, 2022

Uh oh!

jianoaix commented Jul 7, 2022

Uh oh!

clarkzinzow left a comment

Choose a reason for hiding this comment

Uh oh!

clarkzinzow Jul 8, 2022

Choose a reason for hiding this comment

Uh oh!

clarkzinzow Jul 8, 2022

Choose a reason for hiding this comment

Uh oh!

c21 commented Jul 8, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants