Skip to content

ParticleSpecies: Read to dask.dataframe#935

Merged
ax3l merged 4 commits intoopenPMD:devfrom
ax3l:topic-dask
Mar 17, 2021
Merged

ParticleSpecies: Read to dask.dataframe#935
ax3l merged 4 commits intoopenPMD:devfrom
ax3l:topic-dask

Conversation

@ax3l
Copy link
Member

@ax3l ax3l commented Feb 25, 2021

Add a method that reads a particle species into a dask.dataframe.

Feel the power 🔥

Cheers to @dmitry-ganyushin for helping with this!


# example2: momentum histogram
h, bins = da.histogram(df["momentum_z"], bins=50, range=[-8.0e-23, 8.0e-23])
# weights=df["weighting"]
Copy link
Member Author

@ax3l ax3l Feb 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some issue if I pass this argument alone the lines "Series has no attribute chunks" deep inside Dask... Not sure if it refers to our series, though - I think not 😅

Also, let's ask the RAPIDS team tomorrow if we can also generate 2D and ND histograms. That would be tremendously helpful, but I cannot spot such a function in the docs.
Update: opened as dask/dask#7307

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a dask.dataframe.core.Series that does not have the chunks...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding .to_dask_array() fixes this

Comment on lines +67 to +69
# TODO: implement available_chunks for constant record components
# and fall back to a single, big chunk here
if chunks is None:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@franzpoeschel I tried to query available_chunks from a constant BaseRecordComponent and realized this throws a backend error.

Probably the cleanest way for us to handle this would be to check for constant() in the frontend and return the full extend as a single chunk in that case, what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good catch, yeah that's probably the best solution.

Copy link
Member Author

@ax3l ax3l Mar 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented in #942 🎉

Comment on lines +67 to +69
# TODO: implement available_chunks for constant record components
# and fall back to a single, big chunk here
if chunks is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good catch, yeah that's probably the best solution.

for k_r, r in particle_species.items():
for k_rc, rc in r.items():
if not rc.constant:
chunks = rc.available_chunks()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes that chunks are equal across components. What happens if they're not? Will things just be less efficient or will things not work? In the latter case, we should probably guard for this case and throw an error.

Copy link
Member Author

@ax3l ax3l Mar 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, will just be less efficient.
(Also very unlikely.)

"implemented, use pandas dataframes.")

def read_chunk(species, chunk):
stride = np.s_[chunk.offset[0]:chunk.extent[0]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, this assumes that we are dealing with particle data (and hence 1D). Is this checked? We do have a Python class <openPMD.ParticleSpecies>, so this could theoretically be guarded against.
(Or are those lines enough checking?):

ParticleSpecies.to_df = particles_to_dataframe  # noqa
ParticleSpecies.to_dask = particles_to_daskdataframe  # noqa

Copy link
Member Author

@ax3l ax3l Mar 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea to check the chunk to be 1D, yep

Since this is implemented as species and ParticleSpecies, there should be little chance to accidentally pass in a field. Particle arrays are always 1D.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would suggest moving this to a module level function instead of a closure. While using closures should work, there's extra overhead pickling closures vs. module level functions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review & continued guidance!
Fixed in #951

@lgtm-com
Copy link
Contributor

lgtm-com bot commented Mar 11, 2021

This pull request introduces 1 alert when merging 8c5c367 into 6e3b8b2 - view on LGTM.com

new alerts:

  • 1 for Unused import

# example1: average momentum in z
print("<momentum_z>={}".format(df["momentum_z"].mean().compute()))

# example2: momentum histogram
Copy link
Member Author

@ax3l ax3l Mar 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ax3l and others added 3 commits March 16, 2021 21:50
Add a method that reads a particle species into a `dask.dataframe`.

Feel the power 🔥

Co-authored-by: Dmitry Ganyushin <ganyushin@gmail.com>
@lgtm-com
Copy link
Contributor

lgtm-com bot commented Mar 17, 2021

This pull request introduces 1 alert when merging 62a5df4 into 24058e0 - view on LGTM.com

new alerts:

  • 1 for Unused import

If all records are constant, use one large chunk.
@lgtm-com
Copy link
Contributor

lgtm-com bot commented Mar 17, 2021

This pull request introduces 1 alert when merging 0f5090a into 24058e0 - view on LGTM.com

new alerts:

  • 1 for Unused import

@ax3l
Copy link
Member Author

ax3l commented Mar 17, 2021

I think I found what we need for meshes: https://docs.dask.org/en/latest/array-api.html?highlight=from_array#other-functions

dask.array.from_array(x, chunks='auto', name=None, lock=False, asarray=None, fancy=True, getitem=None, meta=None, inline_array=False)
Create dask array from something that looks like an array.

Input must have a .shape, .ndim, .dtype and support numpy-style slicing.
Parameters: |
       x : array_like
  chunks : int, tuple
    How to chunk the array. Must be one of the following forms:
    * A blocksize like 1000.
    * A blockshape like (1000, 1000).
    * Explicit sizes of all blocks along all dimensions like ((1000, 1000, 500), (400, 400)).
    * A size in bytes, like “100 MiB” which will choose a uniform block-like shape
    * The word “auto” which acts like the above, but uses a configuration value array.chunk-size for the chunk size

  -1 or None as a blocksize indicate the size of the corresponding dimension.
...
 asarray : bool, optional
    If True then call np.asarray on chunks to convert them to numpy arrays. If False then chunks are passed through unchanged. If None (default) then we use True if the __array_function__ method is undefined.
...
   fancy : bool, optional
     If x doesn’t support fancy indexing (e.g. indexing with lists or arrays) then set to False.
     Default is True.
...

Not entirely sure yet how to tell it that it needs to call our .flush() once it really reads those blocks, assuming that the read happens delayed... I think we need to write a Record_Component wrapper for Dask that calls .flush() on .asarray()...

@ax3l ax3l merged commit 15a98c7 into openPMD:dev Mar 17, 2021
@ax3l ax3l deleted the topic-dask branch March 17, 2021 07:48
@ax3l ax3l mentioned this pull request Apr 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api: new additions to the API frontend: Python3

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants