Skip to content

[Datasets] Update docs for drop_columns and fix typos#26317

Merged
clarkzinzow merged 2 commits intoray-project:masterfrom
c21:doc
Jul 8, 2022
Merged

[Datasets] Update docs for drop_columns and fix typos#26317
clarkzinzow merged 2 commits intoray-project:masterfrom
c21:doc

Conversation

@c21
Copy link
Contributor

@c21 c21 commented Jul 6, 2022

Why are these changes needed?

We added drop_columns() API to datasets in #26200, so updating documentation here to use the new API - doc/source/data/examples/nyc_taxi_basic_processing.ipynb. In addition, fixing some minor typos after proofreading the datasets documentation.

Related issue number

Closes #26113

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

~~~~~~~~~~~~~~~~~~~

Similarly, you can pass in a filter to ``ray.data.read_parquet()`` (selection pushdown)
Similarly, you can pass in a filter to ``ray.data.read_parquet()`` (filter pushdown)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

selection pushdown is confusing in data world, as it normally means projection to me. Other systems (such as Spark, Presto, Parquet, etc) are always using filter pushdown. We are also using filter pushdown in other places e.g. here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, "selection" means differently in SQL v.s. relational algebra. Using "filter" seems a good choice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed here!

Copy link
Contributor

@jianoaix jianoaix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

~~~~~~~~~~~~~~~~~~~

Similarly, you can pass in a filter to ``ray.data.read_parquet()`` (selection pushdown)
Similarly, you can pass in a filter to ``ray.data.read_parquet()`` (filter pushdown)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, "selection" means differently in SQL v.s. relational algebra. Using "filter" seems a good choice.

@c21
Copy link
Contributor Author

c21 commented Jul 6, 2022

nit: there seems a typo in API comment (closing bracket): https://sourcegraph.com/github.com/ray-project/ray@master/-/blob/python/ray/data/dataset.py?L564

@jianoaix - ah good catch, fixed it.

@jianoaix
Copy link
Contributor

jianoaix commented Jul 7, 2022

@clarkzinzow review or merge?

Copy link
Contributor

@clarkzinzow clarkzinzow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, and many thanks for the drivebys!

:ref:`tensor data guide <datasets_tensor_support>` for more information on working
with tensors in Datasets. Although this simple example demonstrates reading a single
file, note that Datasets can also read directories of JSON files, with one tensor
file, note that Datasets can also read directories of NumPy files, with one tensor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

~~~~~~~~~~~~~~~~~~~

Similarly, you can pass in a filter to ``ray.data.read_parquet()`` (selection pushdown)
Similarly, you can pass in a filter to ``ray.data.read_parquet()`` (filter pushdown)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed here!

@clarkzinzow clarkzinzow merged commit 4e674b6 into ray-project:master Jul 8, 2022
@c21
Copy link
Contributor Author

c21 commented Jul 8, 2022

Thank you @clarkzinzow and @jianoaix for review!

@c21 c21 deleted the doc branch July 8, 2022 00:20
truelegion47 pushed a commit to truelegion47/ray that referenced this pull request Jul 9, 2022
* master: (42 commits)
  [dashboard][2/2] Add endpoints to dashboard and dashboard_agent for liveness check of raylet and gcs (ray-project#26408)
  [Doc] Fix docs feedback button (ray-project#26402)
  [core][1/2] Improve liveness check in GCS  (ray-project#26405)
  [RLlib] Checkpoint and restore connectors. (ray-project#26253)
  [Workflow] Minor refactoring of workflow exceptions (ray-project#26398)
  [workflow] Workflow queue (ray-project#24697)
  [RLlib] Minor simplification of code. (ray-project#26312)
  [AIR] Update TensorflowPredictor to new API (ray-project#26215)
  [RLlib] Make Dataset reader default reader and enable CRR to use dataset (ray-project#26304)
  [runtime_env] [doc] Remove outdated info about "isolated" environment (ray-project#26314)
  [Doc] Fix rate-the-docs plugin (ray-project#26384)
  [Docs] [Serve] Has a consistent landing page style (ray-project#26029)
  [dashboard] Add `RAY_CLUSTER_ACTIVITY_HOOK` to `/api/component_activities` (ray-project#26297)
  [tune] Use `Checkpoint.to_bytes()` for store_to_object (ray-project#25805)
  [tune] Fix `SyncerCallback` having a size limit (ray-project#26371)
  [air] Serialize additional files in dict checkpoints turned dir checkpoints (ray-project#26351)
  [Docs] Add "rate the docs" plugin for feedback on docs (ray-project#26330)
  [Doc] Fix actor example (ray-project#26381)
  Set RAY_USAGE_STATS_EXTRA_TAGS for release tests (ray-project#26366)
  [Datasets] Update docs for drop_columns and fix typos (ray-project#26317)
  ...
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
)

We added drop_columns() API to datasets in ray-project#26200, so updating documentation here to use the new API - doc/source/data/examples/nyc_taxi_basic_processing.ipynb. In addition, fixing some minor typos after proofreading the datasets documentation.

Signed-off-by: Stefan van der Kleij <s.vanderkleij@viroteq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Datasets] Add drop_column() to Dataset

4 participants