Skip to content

Conversation

@grlee77
Copy link
Contributor

@grlee77 grlee77 commented Nov 17, 2021

This PR contains changes to storage.py and meta.py for v3 support. I split this out from the full v3 branch and will have follow up PRs adding v3 support to the Array and Group classes as well as the higher level routines.

This PR defines StoreV3 versions of most existing Store types (no N5Store or ABSStore, currently). This will need some rounds of revision/feedback and there are a number of "TODO" comments in code where I was unsure about desired behavior of some aspect.

For the test cases, I tried to avoid a lot of the duplication by subclassing the v2 test classes, just overriding methods where it was necessary to make changes. These v3 classes are currently in a separate file, but could instead just be appended to the existing file if that is preferred. Another possible approach is to try keeping just the existing classes, but add parameterization based on the Zarr version.

For test coverage, there are a handful of functions like rmdir that support StoreLike inputs that only have test coverage via higher-level functions which are not yet included in this PR, so there will currently be some reduction in test coverage unless that code is removed or we add test cases.

[Description of PR]

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

Define the StoreV3 class and create v3 versions of most existing stores

Add a test_storage_v3.py with test classes inheriting from their v2
counterparts. Only a subset of methods involving differences in v3
behavior were overridden.
@codecov
Copy link

codecov bot commented Nov 17, 2021

Codecov Report

Merging #874 (ba09ed5) into master (5f7e2b6) will decrease coverage by 1.92%.
The diff coverage is 80.66%.

@@            Coverage Diff             @@
##           master     #874      +/-   ##
==========================================
- Coverage   99.94%   98.02%   -1.93%     
==========================================
  Files          32       33       +1     
  Lines       11216    12359    +1143     
==========================================
+ Hits        11210    12115     +905     
- Misses          6      244     +238     
Impacted Files Coverage Δ
zarr/_storage/store.py 69.58% <52.51%> (-30.42%) ⬇️
zarr/tests/test_storage.py 99.92% <66.66%> (-0.08%) ⬇️
zarr/meta.py 84.93% <72.02%> (-15.07%) ⬇️
zarr/storage.py 92.90% <74.10%> (-7.10%) ⬇️
zarr/tests/test_storage_v3.py 96.80% <96.80%> (ø)
zarr/tests/util.py 100.00% <100.00%> (ø)

@joshmoore
Copy link
Member

@grlee77 : while you hammer out the actions fun, is there anything in particular you could use feedback on?

@grlee77
Copy link
Contributor Author

grlee77 commented Nov 30, 2021

while you hammer out the actions fun, is there anything in particular you could use feedback on?

I will try to summarize a bit here later today

@grlee77
Copy link
Contributor Author

grlee77 commented Dec 2, 2021

Probably my biggest question regarding this PR is how to handle extended data types. I have not done this in a way consistent with the spec yet. Specifically there is a description of how data_type metadata should handle this description here. The issue is that we do not have a URI ("value must be a URI that identifies the protocol extension") yet that we can point to?

The _v3_core_type dict in meta.yml contains all the dtypes in the core v3 specification. The other types we need for various tests are:

  • complex (NumPy dtype.kind == 'c')
  • DateTime (NumPy dtype.kind in 'mM')
  • NumPy object arrays (NumPy dtype.kind == 'O')
  • NumPy structured arrays (NumPy dtype.kind == 'V')
  • unicode arrays (NumPy dtype.kind == 'U')
  • bytes arrays (NumPy dtype.kind == 'S')

The unicode and byte array cases are currently not covered in our current Zarr test suite, but they are used in the Xarray test suite downstream.

Does each of these need to have its own separate extension URI that we can point to in the data_type metadata or can we just pace a nonexistent placeholder URI for these for now? Otherwise, I can see this taking quite a while to resolve...

I don't think we should be introducing all of the ZARR_V3_ALLOW_... environment variables near the top of meta.yml in this PR and think we should probably remove those. I was using that initially in testing, but doubt we want all of those as public.

@grlee77
Copy link
Contributor Author

grlee77 commented Dec 2, 2021

Actually, I see now that there are indeed already (draft) extensions for two of the above here:
https://zarr-specs.readthedocs.io/en/core-protocol-v3.0-dev/protocol/extensions.html

Let me update this PR to use data_type extensions properly

@joshmoore
Copy link
Member

Let me update this PR to use data_type extensions properly

Sounds good. I do think those URIs are fairly key to the extension mechanism, so before we get too far we should have the most core ones well-defined as good examples.

cc: @alimanfoo

@pep8speaks
Copy link

pep8speaks commented Dec 15, 2021

Hello @grlee77! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-12-21 11:49:10 UTC

@grlee77 grlee77 force-pushed the v3-store-part1 branch 2 times, most recently from c24305c to 9950a4d Compare December 15, 2021 00:49
@grlee77
Copy link
Contributor Author

grlee77 commented Dec 15, 2021

Sounds good. I do think those URIs are fairly key to the extension mechanism, so before we get too far we should have the most core ones well-defined as good examples.

I have not worked on any new URI text yet, but I did update meta.yml here to read and write data_type metadata with the format described in the v3 spec for extensions. Specifically, for the core dtypes, the JSON just has a string such as >f8, but for any extended type the entry is a dict with 3 keys: extension, type and fallback as specified in this section of the spec. The spec just states that it must be "an object containing the names extension, type and fallback" so a dict seemed an easy JSON-compatible way to store that.

The newly introduced get_extended_dtype_info function converts a numpy dtype's string representation to a corresponding dict of extension info. For complex and datetime64/timedelta64 types these point to the existing URI stubs. There are 4 other types that don't have a URL to point to at the moment 1.) object arrays, 2.) structured arrays, 3.) bytestring arrays, 4.) unicode arrays. Do you think those should be 4 separate extensions or should bytestring and unicode be combined into one?

@grlee77
Copy link
Contributor Author

grlee77 commented Dec 15, 2021

Also, are any of these cases where we should be specifying a fallback?. An array of complex128 can be read as a double-length array of type float64 for instance with real and imaginary values interleaved?

Copy link
Member

@joshmoore joshmoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so a dict seemed an easy JSON-compatible way to store that.

nods

should bytestring and unicode be combined into one?

I don't know the use case(s) for bytestring. Have you seen anything mentioned?

are any of these cases where we should be specifying a fallback?. An array of complex128 can be read as a double-length array of type float64 for instance with real and imaginary values interleaved?

Is the fallback the extent of "extension subclassing"? If so, it would be good to have examples. So far it's always been time->long or similar. The complex -> 2 floats would certainly test the concept further.

cc: @DennisHeimbigner @alimanfoo

from typing import cast, Union, Any, List, Mapping as MappingType, Optional

ZARR_FORMAT = 2
ZARR_FORMAT_v3 = 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is realistic, but I could see also adding ZARR_FORMAT_V2 such that ZARR_FORMAT=ZARR_FORMAT_V2 defines the current default with the hope of swapping that at some (well-defined?) point in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be fine to add that. not sure how this is used

currently ZARR_FORMAT_v3 is only used once to set the ZARR_FORMAT attribute in the Metadata3 class so we could just remove it if there is not another use for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like the other ZARR_FORMAT is used mostly in test cases, but also when setting metadata in n5.py


@staticmethod
def _valid_key(key: str) -> bool:
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ docs!


# FLOAT_FILLS = {"NaN": np.nan, "Infinity": np.PINF, "-Infinity": np.NINF}

_default_entry_point_metadata_v3 = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a feeling how much this is going to grow? i.e. if we should plan for this to have its own module?

@grlee77
Copy link
Contributor Author

grlee77 commented Dec 17, 2021

I don't know the use case(s) for bytestring. Have you seen anything mentioned?

IIRC there is some test case in Xarray that currently tests zarr v2 with both unicode and bytestring arrays, but we don't currently have tests for either of those in this repository!

@DennisHeimbigner
Copy link

The usual case for bytestring is to hold something like an image.

grlee77 and others added 2 commits December 17, 2021 18:21
@grlee77
Copy link
Contributor Author

grlee77 commented Mar 4, 2022

closing, these commits along with additional fixes and tests are incorporated in #898

@grlee77 grlee77 closed this Mar 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants