ParticleSpecies: Read to dask.dataframe#935

Merged

ax3l merged 4 commits intoopenPMD:devfrom

ax3l:topic-dask

Mar 17, 2021

Member

ax3l commented Feb 25, 2021

Add a method that reads a particle species into a dask.dataframe.

Feel the power 🔥

Cheers to @dmitry-ganyushin for helping with this!

ax3l added frontend: Python3 api: new labels

ax3l requested review from dmitry-ganyushin and franzpoeschel

February 25, 2021 23:03

ax3l commented

View reviewed changes

examples/11_particle_dataframe.py Outdated

+                  #   example2: momentum histogram
+                  h, bins = da.histogram(df["momentum_z"], bins=50, range=[-8.0e-23, 8.0e-23])
+                  #                      weights=df["weighting"]

Member Author

ax3l Feb 25, 2021 •

edited

Loading

Some issue if I pass this argument alone the lines "Series has no attribute chunks" deep inside Dask... Not sure if it refers to our series, though - I think not 😅

Also, let's ask the RAPIDS team tomorrow if we can also generate 2D and ND histograms. That would be tremendously helpful, but I cannot spot such a function in the docs.
Update: opened as dask/dask#7307

Member Author

ax3l Mar 17, 2021

It's a dask.dataframe.core.Series that does not have the chunks...

Member Author

ax3l Mar 17, 2021

Adding .to_dask_array() fixes this

src/binding/python/openpmd_api/DaskDataFrame.py Outdated

Comment on lines +67 to +69

+                  # TODO: implement available_chunks for constant record components
+                  #       and fall back to a single, big chunk here
+                  if chunks is None:

Member Author

ax3l Feb 25, 2021

@franzpoeschel I tried to query available_chunks from a constant BaseRecordComponent and realized this throws a backend error.

Probably the cleanest way for us to handle this would be to check for constant() in the frontend and return the full extend as a single chunk in that case, what do you think?

Contributor

franzpoeschel Mar 1, 2021

Ah good catch, yeah that's probably the best solution.

Member Author

ax3l Mar 12, 2021 •

edited

Loading

Implemented in #942 🎉

.rodare.json Show resolved Hide resolved

dmitry-ganyushin reviewed

View reviewed changes

src/binding/python/openpmd_api/DaskDataFrame.py Outdated Show resolved Hide resolved

ax3l force-pushed the topic-dask branch from 72e13db to d88adc6 Compare

February 26, 2021 23:33

franzpoeschel reviewed

View reviewed changes

src/binding/python/openpmd_api/DaskDataFrame.py Outdated

Comment on lines +67 to +69

+                  # TODO: implement available_chunks for constant record components
+                  #       and fall back to a single, big chunk here
+                  if chunks is None:

Contributor

franzpoeschel Mar 1, 2021

Ah good catch, yeah that's probably the best solution.

src/binding/python/openpmd_api/DaskDataFrame.py

+                  for k_r, r in particle_species.items():
+                      for k_rc, rc in r.items():
+                          if not rc.constant:
+                              chunks = rc.available_chunks()

Contributor

franzpoeschel Mar 1, 2021

This assumes that chunks are equal across components. What happens if they're not? Will things just be less efficient or will things not work? In the latter case, we should probably guard for this case and throw an error.

Member Author

ax3l Mar 3, 2021 •

edited

Loading

Yep, will just be less efficient.
(Also very unlikely.)

src/binding/python/openpmd_api/DaskDataFrame.py Outdated Show resolved Hide resolved

src/binding/python/openpmd_api/DaskDataFrame.py

+                                       "implemented, use pandas dataframes.")
+                  def read_chunk(species, chunk):
+                      stride = np.s_[chunk.offset[0]:chunk.extent[0]]

Contributor

franzpoeschel Mar 1, 2021

Similarly, this assumes that we are dealing with particle data (and hence 1D). Is this checked? We do have a Python class <openPMD.ParticleSpecies>, so this could theoretically be guarded against.
(Or are those lines enough checking?):

ParticleSpecies.to_df = particles_to_dataframe  # noqa
ParticleSpecies.to_dask = particles_to_daskdataframe  # noqa

Member Author

ax3l Mar 3, 2021 •

edited

Loading

Good idea to check the chunk to be 1D, yep

Since this is implemented as species and ParticleSpecies, there should be little chance to accidentally pass in a field. Particle arrays are always 1D.

Contributor

jakirkham Mar 26, 2021

Would suggest moving this to a module level function instead of a closure. While using closures should work, there's extra overhead pickling closures vs. module level functions

Member Author

ax3l Mar 26, 2021

Thank you for the review & continued guidance!
Fixed in #951

ax3l commented

View reviewed changes

src/binding/python/openpmd_api/DaskDataFrame.py Outdated Show resolved Hide resolved

franzpoeschel mentioned this pull request

Fix available_chunks() for constant components #942

Merged

Contributor

lgtm-com bot commented Mar 11, 2021

This pull request introduces 1 alert when merging 8c5c367 into 6e3b8b2 - view on LGTM.com

new alerts:

1 for Unused import

ax3l commented

View reviewed changes

examples/11_particle_dataframe.py

+                  #   example1: average momentum in z
+                  print("<momentum_z>={}".format(df["momentum_z"].mean().compute()))
+                  #   example2: momentum histogram

Member Author

ax3l Mar 12, 2021 •

edited

Loading

2D histograms in dask for a phase space example:

ax3l and others added 3 commits

March 16, 2021 21:50


          ParticleSpecies: Read to dask.dataframe

e6b37ba

Add a method that reads a particle species into a `dask.dataframe`.

Feel the power 🔥

Co-authored-by: Dmitry Ganyushin <ganyushin@gmail.com>


          put delayed into a list

02bf7f2

https://docs.dask.org/en/latest/delayed-collections.html


          Dask Example: Histogram Usage w/ Weight

3bdce90

ax3l force-pushed the topic-dask branch from 8c5c367 to 62a5df4 Compare

March 17, 2021 06:29

Contributor

lgtm-com bot commented Mar 17, 2021

This pull request introduces 1 alert when merging 62a5df4 into 24058e0 - view on LGTM.com

new alerts:

1 for Unused import

ax3l mentioned this pull request

Adding support for multidimensional histograms with dask.array.histogramdd dask/dask#7387

Merged

ax3l force-pushed the topic-dask branch 2 times, most recently from ee44ed5 to 9b9b567 Compare

March 17, 2021 06:45


          Dask Chunks: Fallback to One

0f5090a

If all records are constant, use one large chunk.

ax3l force-pushed the topic-dask branch from 9b9b567 to 0f5090a Compare

March 17, 2021 06:45

Contributor

lgtm-com bot commented Mar 17, 2021

This pull request introduces 1 alert when merging 0f5090a into 24058e0 - view on LGTM.com

new alerts:

1 for Unused import

Member Author

ax3l commented Mar 17, 2021 •

edited

Loading

I think I found what we need for meshes: https://docs.dask.org/en/latest/array-api.html?highlight=from_array#other-functions

dask.array.from_array(x, chunks='auto', name=None, lock=False, asarray=None, fancy=True, getitem=None, meta=None, inline_array=False)

Create dask array from something that looks like an array.

Input must have a .shape, .ndim, .dtype and support numpy-style slicing.

Parameters: |
       x : array_like
  chunks : int, tuple
    How to chunk the array. Must be one of the following forms:
    * A blocksize like 1000.
    * A blockshape like (1000, 1000).
    * Explicit sizes of all blocks along all dimensions like ((1000, 1000, 500), (400, 400)).
    * A size in bytes, like “100 MiB” which will choose a uniform block-like shape
    * The word “auto” which acts like the above, but uses a configuration value array.chunk-size for the chunk size

  -1 or None as a blocksize indicate the size of the corresponding dimension.
...
 asarray : bool, optional
    If True then call np.asarray on chunks to convert them to numpy arrays. If False then chunks are passed through unchanged. If None (default) then we use True if the __array_function__ method is undefined.
...
   fancy : bool, optional
     If x doesn’t support fancy indexing (e.g. indexing with lists or arrays) then set to False.
     Default is True.
...

Not entirely sure yet how to tell it that it needs to call our .flush() once it really reads those blocks, assuming that the read happens delayed... I think we need to write a Record_Component wrapper for Dask that calls .flush() on .asarray()...

ax3l merged commit 15a98c7 into openPMD:dev

ax3l deleted the topic-dask branch

March 17, 2021 07:48

This was referenced Mar 26, 2021

Dask DataFrame: Improve Pickle Support #951

Merged

Array: Slightly More Complex Chunks dask/dask#7475

Closed

Dask: Array #952

Merged

ax3l mentioned this pull request

DaskDataFrame: Fix Chunks #959

Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api: new frontend: Python3