Skip to content

Chardata first working on feature branch#6975

Open
pp-mo wants to merge 57 commits intoSciTools:FEATURE_chardatafrom
pp-mo:chardata_plus_encoded_datasets
Open

Chardata first working on feature branch#6975
pp-mo wants to merge 57 commits intoSciTools:FEATURE_chardatafrom
pp-mo:chardata_plus_encoded_datasets

Conversation

@pp-mo
Copy link
Member

@pp-mo pp-mo commented Mar 11, 2026

Successor to #6898
Now targetting (new) FEATURE_chardata feature branch in main repo

TODO: please check that any remaining unresolved issues on #6898 are now resolved here

pp-mo added 30 commits January 19, 2026 11:49
…Mostly working?

Get 'create_cf_data_variable' to call 'create_generic_cf_array_var': Mostly working?
Rename; addin parts of old investigation; add temporary notes.
@pp-mo pp-mo requested a review from ukmo-ccbunney March 11, 2026 01:35
@pp-mo pp-mo changed the title Chardata plus encoded datasets Chardata first working on feature branch Mar 11, 2026
Copy link
Contributor

@ukmo-ccbunney ukmo-ccbunney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking pretty good. 👍🏼
I've got a few comments, questions and suggestions.

I have not looked at the tests yet - just the main Iris code. I thought it was worth submitting the review at this point so you can see the comments. I'll take a look at the tests next.

Also - remind me - what are we doing in the case where data is stored as a netCDF string type - i.e. the variable length string type? At the moment that just loads in as an object array in numpy. Were we just leaving that as-is? We can't write that kind of datatype in Iris.

E.g. a netcdf file like this:

netcdf varlen_str_nc {
dimensions:
	len = 4 ;
variables:
	string strarr(len) ;
data:

 strarr = "a", "bb", "ccc", "dddd"
}

# string width, depending on the (read) encoding used.
encoding = self.read_encoding
if "utf-16" in encoding:
# Each char needs at least 2 bytes -- including a terminator char
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, there is no "terminator char"
UTF-16 and UTF-32 both start with an endian indicator which is the 2 byte pattern 0xFF, 0xFE.
The order of those two bytes encodes the endian of the Unicode string.
The UTF-32 encoding has this padded out to 4 bytes.
You can see this by encoding an empty string:

"".encode("utf-16")
>>  b'\xff \xfe'

"".encode("utf-32")
>> b'\xff \xfe \x00 \x00'   # same, but padded with zeros for second two bytes.

The explicit -be and -le (big/little endian) version of these encodings don't require this endian encoding at the beginning of the string as the endian is explicit. Similarly, there is no need for it in utf-8 as the endian doesn't matter (single byte).

So, I think the commend should say something like:

Suggested change
# Each char needs at least 2 bytes -- including a terminator char
# Each char needs at least 2 bytes, but first two bytes are used for endian encoding

and something similar for UTF32.

I guess the only issue here is that with UTF8 and UTF-16 encodings, if there are non-ascii characters in the string we will always over-estimate the string length. I can't see how we can avoid this though.

Comment on lines +245 to +247
self.perform_decoding = perform_decoding
yield
self.perform_decoding = old_setting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need putting in a try...finally block to ensure it is reset if an exception occurs?

Suggested change
self.perform_decoding = perform_decoding
yield
self.perform_decoding = old_setting
try:
self.perform_decoding = perform_decoding
yield
finally:
self.perform_decoding = old_setting

@property
def dimensions(self):
dimensions = self._contained_instance.dimensions
is_chardata = np.issubdtype(self._contained_instance.dtype, np.bytes_)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We make this comparison several times in this class, better as a property?

"""
with _GLOBAL_NETCDF4_LOCK:
new_group = self._contained_instance.createGroup(*args, **kwargs)
return GroupWrapper.from_existing(new_group)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remind me why the GroupWrapper class now needs to be accessed indirectly via self.__class__?

I understand why you replace the VARIABLE_WRAPPER references, e.g. on line 222, but I can't quite see why the GroupWrapper needed changing here.

return cf_mesh_name

def _set_cf_var_attributes(self, cf_var, element):
from iris.cube import Cube
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is ruff not complaining about this inline import?!

if long_name is not None:
_setncattr(cf_var, "long_name", long_name)
# NB this bit is a nasty hack to preserve existing behaviour through a refactor:
# The attributes for Coords are created in the order units, standard_name,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this matter?
(I understand that it does, bit don't understand why!)

encoding = element.attributes.get("_Encoding", "ascii")
# TODO: this can fail -- use a sensible warning + default?
encoding = codecs.lookup(encoding).name
if encoding == "utf-32":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handle the be and le variants of UTF32 here, i.e. utf-32-be and utf-32-le?
I don't know how often you would see explicit endian encodings in the wild (probably hardly ever?), but we could code defensively here?

Suggested change
if encoding == "utf-32":
if "utf-32" in encoding:

Of course in the explicit endian versions the extra 4 bytes is not needed (that's for encoding the endian when using plain utf-32)

string_dimension_depth //= 4
encoding = element.attributes.get("_Encoding", "ascii")
# TODO: this can fail -- use a sensible warning + default?
encoding = codecs.lookup(encoding).name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try block around this lookup?

if encoding == "utf-32":
# UTF-32 is a special case -- always 4 exactly bytes per char, plus 4
string_dimension_depth += 4
else:
Copy link
Contributor

@ukmo-ccbunney ukmo-ccbunney Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we not going to offer utf8 and utf16 here?

Also, UTF8 saving only works if you don't have any non ascii characters in your data, otherwise the calculated string length is too small.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants