Chardata first working on feature branch#6975
Chardata first working on feature branch#6975pp-mo wants to merge 57 commits intoSciTools:FEATURE_chardatafrom
Conversation
…Mostly working? Get 'create_cf_data_variable' to call 'create_generic_cf_array_var': Mostly working?
Rename; addin parts of old investigation; add temporary notes.
…or overlength writes.
…t_cf_var_data' function.
There was a problem hiding this comment.
Looking pretty good. 👍🏼
I've got a few comments, questions and suggestions.
I have not looked at the tests yet - just the main Iris code. I thought it was worth submitting the review at this point so you can see the comments. I'll take a look at the tests next.
Also - remind me - what are we doing in the case where data is stored as a netCDF string type - i.e. the variable length string type? At the moment that just loads in as an object array in numpy. Were we just leaving that as-is? We can't write that kind of datatype in Iris.
E.g. a netcdf file like this:
netcdf varlen_str_nc {
dimensions:
len = 4 ;
variables:
string strarr(len) ;
data:
strarr = "a", "bb", "ccc", "dddd"
}
| # string width, depending on the (read) encoding used. | ||
| encoding = self.read_encoding | ||
| if "utf-16" in encoding: | ||
| # Each char needs at least 2 bytes -- including a terminator char |
There was a problem hiding this comment.
AFAIK, there is no "terminator char"
UTF-16 and UTF-32 both start with an endian indicator which is the 2 byte pattern 0xFF, 0xFE.
The order of those two bytes encodes the endian of the Unicode string.
The UTF-32 encoding has this padded out to 4 bytes.
You can see this by encoding an empty string:
"".encode("utf-16")
>> b'\xff \xfe'
"".encode("utf-32")
>> b'\xff \xfe \x00 \x00' # same, but padded with zeros for second two bytes.
The explicit -be and -le (big/little endian) version of these encodings don't require this endian encoding at the beginning of the string as the endian is explicit. Similarly, there is no need for it in utf-8 as the endian doesn't matter (single byte).
So, I think the commend should say something like:
| # Each char needs at least 2 bytes -- including a terminator char | |
| # Each char needs at least 2 bytes, but first two bytes are used for endian encoding |
and something similar for UTF32.
I guess the only issue here is that with UTF8 and UTF-16 encodings, if there are non-ascii characters in the string we will always over-estimate the string length. I can't see how we can avoid this though.
| self.perform_decoding = perform_decoding | ||
| yield | ||
| self.perform_decoding = old_setting |
There was a problem hiding this comment.
Does this need putting in a try...finally block to ensure it is reset if an exception occurs?
| self.perform_decoding = perform_decoding | |
| yield | |
| self.perform_decoding = old_setting | |
| try: | |
| self.perform_decoding = perform_decoding | |
| yield | |
| finally: | |
| self.perform_decoding = old_setting |
| @property | ||
| def dimensions(self): | ||
| dimensions = self._contained_instance.dimensions | ||
| is_chardata = np.issubdtype(self._contained_instance.dtype, np.bytes_) |
There was a problem hiding this comment.
We make this comparison several times in this class, better as a property?
| """ | ||
| with _GLOBAL_NETCDF4_LOCK: | ||
| new_group = self._contained_instance.createGroup(*args, **kwargs) | ||
| return GroupWrapper.from_existing(new_group) |
There was a problem hiding this comment.
Can you remind me why the GroupWrapper class now needs to be accessed indirectly via self.__class__?
I understand why you replace the VARIABLE_WRAPPER references, e.g. on line 222, but I can't quite see why the GroupWrapper needed changing here.
| return cf_mesh_name | ||
|
|
||
| def _set_cf_var_attributes(self, cf_var, element): | ||
| from iris.cube import Cube |
There was a problem hiding this comment.
How is ruff not complaining about this inline import?!
| if long_name is not None: | ||
| _setncattr(cf_var, "long_name", long_name) | ||
| # NB this bit is a nasty hack to preserve existing behaviour through a refactor: | ||
| # The attributes for Coords are created in the order units, standard_name, |
There was a problem hiding this comment.
Why does this matter?
(I understand that it does, bit don't understand why!)
| encoding = element.attributes.get("_Encoding", "ascii") | ||
| # TODO: this can fail -- use a sensible warning + default? | ||
| encoding = codecs.lookup(encoding).name | ||
| if encoding == "utf-32": |
There was a problem hiding this comment.
Handle the be and le variants of UTF32 here, i.e. utf-32-be and utf-32-le?
I don't know how often you would see explicit endian encodings in the wild (probably hardly ever?), but we could code defensively here?
| if encoding == "utf-32": | |
| if "utf-32" in encoding: |
Of course in the explicit endian versions the extra 4 bytes is not needed (that's for encoding the endian when using plain utf-32)
| string_dimension_depth //= 4 | ||
| encoding = element.attributes.get("_Encoding", "ascii") | ||
| # TODO: this can fail -- use a sensible warning + default? | ||
| encoding = codecs.lookup(encoding).name |
There was a problem hiding this comment.
try block around this lookup?
| if encoding == "utf-32": | ||
| # UTF-32 is a special case -- always 4 exactly bytes per char, plus 4 | ||
| string_dimension_depth += 4 | ||
| else: |
There was a problem hiding this comment.
Are we not going to offer utf8 and utf16 here?
Also, UTF8 saving only works if you don't have any non ascii characters in your data, otherwise the calculated string length is too small.
Successor to #6898
Now targetting (new) FEATURE_chardata feature branch in main repo
TODO: please check that any remaining unresolved issues on #6898 are now resolved here