Chardata first working on feature branch by pp-mo · Pull Request #6975 · SciTools/iris

pp-mo · 2026-03-11T01:35:22Z

Successor to #6898
Now targetting (new) FEATURE_chardata feature branch in main repo

TODO: please check that any remaining unresolved issues on #6898 are now resolved here

…Mostly working? Get 'create_cf_data_variable' to call 'create_generic_cf_array_var': Mostly working?

… Cubes.

Rename; addin parts of old investigation; add temporary notes.

…or overlength writes.

…width.

…oxies.

…t_cf_var_data' function.

ukmo-ccbunney

Looking pretty good. 👍🏼
I've got a few comments, questions and suggestions.

I have not looked at the tests yet - just the main Iris code. I thought it was worth submitting the review at this point so you can see the comments. I'll take a look at the tests next.

Also - remind me - what are we doing in the case where data is stored as a netCDF string type - i.e. the variable length string type? At the moment that just loads in as an object array in numpy. Were we just leaving that as-is? We can't write that kind of datatype in Iris.

E.g. a netcdf file like this:

netcdf varlen_str_nc {
dimensions:
	len = 4 ;
variables:
	string strarr(len) ;
data:

 strarr = "a", "bb", "ccc", "dddd"
}

ukmo-ccbunney · 2026-03-11T14:36:42Z

lib/iris/fileformats/netcdf/_bytecoding_datasets.py

+        #  string width, depending on the (read) encoding used.
+        encoding = self.read_encoding
+        if "utf-16" in encoding:
+            # Each char needs at least 2 bytes -- including a terminator char


AFAIK, there is no "terminator char"
UTF-16 and UTF-32 both start with an endian indicator which is the 2 byte pattern 0xFF, 0xFE.
The order of those two bytes encodes the endian of the Unicode string.
The UTF-32 encoding has this padded out to 4 bytes.
You can see this by encoding an empty string:

"".encode("utf-16") >> b'\xff \xfe' "".encode("utf-32") >> b'\xff \xfe \x00 \x00' # same, but padded with zeros for second two bytes.

The explicit -be and -le (big/little endian) version of these encodings don't require this endian encoding at the beginning of the string as the endian is explicit. Similarly, there is no need for it in utf-8 as the endian doesn't matter (single byte).

So, I think the commend should say something like:

Suggested change

# Each char needs at least 2 bytes -- including a terminator char

# Each char needs at least 2 bytes, but first two bytes are used for endian encoding

and something similar for UTF32.

I guess the only issue here is that with UTF8 and UTF-16 encodings, if there are non-ascii characters in the string we will always over-estimate the string length. I can't see how we can avoid this though.

ukmo-ccbunney · 2026-03-11T14:45:33Z

lib/iris/fileformats/netcdf/_bytecoding_datasets.py

+        self.perform_decoding = perform_decoding
+        yield
+        self.perform_decoding = old_setting


Does this need putting in a try...finally block to ensure it is reset if an exception occurs?

Suggested change

self.perform_decoding = perform_decoding

yield

self.perform_decoding = old_setting

try:

self.perform_decoding = perform_decoding

yield

finally:

self.perform_decoding = old_setting

ukmo-ccbunney · 2026-03-11T14:55:13Z

lib/iris/fileformats/netcdf/_bytecoding_datasets.py

+    @property
+    def dimensions(self):
+        dimensions = self._contained_instance.dimensions
+        is_chardata = np.issubdtype(self._contained_instance.dtype, np.bytes_)


We make this comparison several times in this class, better as a property?

ukmo-ccbunney · 2026-03-11T15:03:02Z

lib/iris/fileformats/netcdf/_thread_safe_nc.py

        """
        with _GLOBAL_NETCDF4_LOCK:
            new_group = self._contained_instance.createGroup(*args, **kwargs)
-        return GroupWrapper.from_existing(new_group)


Can you remind me why the GroupWrapper class now needs to be accessed indirectly via self.__class__?

I understand why you replace the VARIABLE_WRAPPER references, e.g. on line 222, but I can't quite see why the GroupWrapper needed changing here.

ukmo-ccbunney · 2026-03-11T15:21:36Z