Skip to content

JSON codec reshapes string arrays #76

@jeromekelleher

Description

@jeromekelleher

This is carrying on from zarr-developers/zarr-python#258

I've tried to come up with a minimal example, but it's tricky to illustrate without showing the context. Here is an interaction with zarr with some instrumentation in the encode/decode methods for json.

z = zarr.empty(2, dtype=object, object_codec=numcodecs.JSON(), chunks=(1,))
z[0] = ["11"]
z[1] = ["1", "1"]

print(z[:]) # Borks

output:

INPUT: (1,)
INPUT: (1,)
OUTPUT: (1, 1)
OUTPUT: (1, 2)
Traceback (most recent call last):
  File "dev.py", line 34, in <module>
    print(z[:]) # Borks
  File "/home/jk/.local/lib/python3.5/site-packages/zarr/core.py", line 559, in __getitem__
    return self.get_basic_selection(selection, fields=fields)
  File "/home/jk/.local/lib/python3.5/site-packages/zarr/core.py", line 685, in get_basic_selection
    fields=fields)
  File "/home/jk/.local/lib/python3.5/site-packages/zarr/core.py", line 727, in _get_basic_selection_nd
    return self._get_selection(indexer=indexer, out=out, fields=fields)
  File "/home/jk/.local/lib/python3.5/site-packages/zarr/core.py", line 1015, in _get_selection
    drop_axes=indexer.drop_axes, fields=fields)
  File "/home/jk/.local/lib/python3.5/site-packages/zarr/core.py", line 1608, in _chunk_getitem
    chunk = self._decode_chunk(cdata)
  File "/home/jk/.local/lib/python3.5/site-packages/zarr/core.py", line 1751, in _decode_chunk
    chunk = chunk.reshape(self._chunks, order=self._order)
ValueError: cannot reshape array of size 2 into shape (1,)

The INPUT lines are the shapes of the input arrays to encode and the OUTPUT lines are the corresponding output shapes of the arrays from decode.

Problem description

When calling numpy.array([["s1", "s2"], ["s3, "s4"]], dtype=object) numpy is quite aggressive about reshaping the array to store things more efficiently.

I've played around with this a fair bit, and I think the only options are to

  1. Drop the numpy dependency in the encoding and decoding steps for JSON (i.e, don't include the dtype in the JSON encoding), and provide the supplied argument directly to the JSON encoder (and conversely, directly return the value of json.loads() from decode.

  2. Also encode the input array shape in the JSON encoding.

Both of these options are ugly because they break backward compatibility. I'll make a PR for demonstrating option 2 in a minute for discussion.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions