Read fixed size strings without padding characters by FlyingSamson · Pull Request #692 · ess-dmsc/h5cpp

FlyingSamson · 2026-03-20T13:07:06Z

This would be one way to overcome the current issue that when reading fixed-size strings into std::strings all padding characters are also returned.

This will certainly introduce a certain overhead compared to not trimming the read string (I did not perform any benchmarks though) which need to be contemplated.
On the other hand, in cases where the caller actually wants the trimmed string, forcing him to trim the string himself after it has been created might incur even larger costs.

Resolves #661

Also relevant for #215 and #224

FlyingSamson · 2026-03-20T14:44:02Z


    AND_GIVEN("an instance of a smaller string") {
      std::string write = "hell";
-      THEN("we have to resize the string to the appropriate size") {


Not entirely sure, why this was required in the first place.

From what I understand the code will write shorter strings just fine. This line merely ensured that in the REQUIRE_THAT below write also contained trailing \0 characters.

ggoneiESS · 2026-03-20T17:29:34Z

I don't quite have the mental capacity to deal with this on a Friday evening, but I would refer you to my comments on a different PR:

ess-dmsc/streaming-data-types#102 (comment)

variable-length datasets cannot be compressed
the data no longer exists contiguously (it necessarily becomes an array of pointers to strings, rather than just raw data)
And (academic but technical arguments)

heap storage requires more space than regular 'raw data' storage (i.e. how the HDF5 object exists in memory)
general reduction in I/O efficiency because it requires individual write operations for each data element rather than one write per dataset chunk (actually, chunking isn't allowed at all)
Performance is definitely at a premium V storage.

I found this via the HDF5 clinic - https://steven-varga.ca/blog/hdf5-fixed-vs-variable-benchmark/ and it provides a CPP file. It might be possible to incorporate into a filewriter test.

Surely if it's like to be a different length wouldn't it be better to use VarLengthString instead anyway?

yuelongyu

Hi @FlyingSamson,

Thanks for the PR.

The datatype::StringPad::NullTerm: I agree, it could be removed. However, the padding part, I think it would be good to stay as it is. It might be not the behavior use expected, user writes the string with space padding intentionally and reads it back with different length. For example, in some protocol, it requires fixed length of commands, so padding is needed. If a user save such a command in file and read it back, the user would expect the command is same as what it's written there.

jkotan

@FlyingSamson, thank you very much for your work. Is it possible for you to leave in the PR on the part with '\0' term (without padding part) so we could release it as well with v0.8.0. I don't think removing padding is good think to do. Is there any deeper reason to remove them?

ggoneiESS · 2026-03-24T11:09:56Z

If this isn't a use case for VarLengthString can I suggest we go slightly further and instead rename the proposed from_bufferto e.g. from_buffer_trim and use a flag (default = False) in from_buffer to call the proposed method? Then we don't have to worry about the effects downstream.

Particularly regarding

this will certainly introduce a certain overhead compared to not trimming the read string (I did not perform any benchmarks though) which need to be contemplated.

FlyingSamson · 2026-03-24T11:24:31Z

Thank you all for your feedback.

@ggoneiESS Regarding your comment

Surely if it's like to be a different length wouldn't it be better to use VarLengthString instead anyway?

Unfortunately I am not in control of the file layout containing these strings. In this particular case all attributes in the file are semantically of variable length but stored as fixed size length with length equal to the actual string length + 1 (for the trailing \0).

Regarding the cited comment: I think that reading a dataset into multiple std::strings will inevitably result in an "array of pointers to strings" (except if strings are sufficiently small for SSO).

@yuelongyu, @jkotan I certainly can remove the changes for padded string and only leave the part for null-terminated strings. However, the way I understand the inner workings of to_buffer the user could as well write a std::string of, say, lenght 3 into a fixed size string of length 10 in the hdf5 file without explicitly adding the padding to his std::string. Thus the argument

, user writes the string with space padding intentionally and reads it back with different length. For example, in some protocol, it requires fixed length of commands, so padding is needed. If a user save such a command in file and read it back, the user would expect the command is same as what it's written there.

might not apply universally. But I'm totally fine with leaving it as is.

@ggoneiESS Regarding

If this isn't a use case for VarLengthString can I suggest we go slightly further and instead rename the proposed from_bufferto e.g. from_buffer_trim and use a flag (default = False) in from_buffer to call the proposed method? Then we don't have to worry about the effects downstream.

Sure, I can have a look at that. However, I'm not sure yet, how this would be passed along from higher-level functions or at which point this decision would be made.

ggoneiESS · 2026-03-24T11:33:37Z

Possibly just around src/h5cpp/node/dataset.hpp:814 and src/h5cpp/attribute/attribute.hpp:272

yuelongyu · 2026-03-24T13:28:09Z

HI @FlyingSamson ,

Regarding the padding, I agree with you, this doesn't apply universally, actually removing or keeping padding is a debatable thing here, that's why I would prefer to keep the current behavior.

FlyingSamson · 2026-03-24T14:03:35Z

If we indeed pass this down from the user to from_buffer, should we keep the option also for trimming the padded strings?

That is, give the user the option to also trim strings with null-charcter or space padding.

jkotan · 2026-03-24T15:10:50Z

Hi @FlyingSamson and @ggoneiESS , sure for me we could keep removing of padding characters but not as a default behavior, i.e. we could create overloaded from_buffer method or with another name or with some addition default parameters (if it works with traits).
The point is the removing padding can remove during the reading then it was writer e.g. when written string ends with padding characters. E.g. you write " A B " with ' ' padding and then you can get after reading only " A B" (so you loose the information)

Of course we could skip removing padding

This reverts commit bd9ae04.

FlyingSamson · 2026-03-27T14:11:48Z

Unfortunately, I did not finish the implementation for datasets as well yet and will be out of office until mid of April. I will continue working on this when I'm back. Just so that you do not wonder if I do not reply immediately during the next weeks.

But please feel free to leave additional comments on the existing implementation if you have any.

FlyingSamson · 2026-04-20T10:29:26Z

Implementing the different code path for reading attributes into std::string was relatively straight forward. However, looking at the current implementation within node/dataset.hpp, I'm not quite sure how to best implements the same thing for datasets.

My main concern is, that while T in case of attributes was std::string it is now some container of something that might be std::string, making writing a explicit specialization like in the attribute case impossible without knowledge about how to retrieve the underlying type of the container.
On top adding the bool trim argument to all read functions would yield a lot of overloads, because they already have a argument with default value for dtpl.

I wonder if one should leave (at least) dataset as is and let users handle this on their own by writing a thin wrapper around whichever container they wish to use and implement FixedLengthStringTrait for that wrapper. At least I think that this should work, e.g., something like

struct TrimmedFixedLengthStringDataset {
  std::vector<std::string> data;
};
template<>
struct FixedLengthStringTrait<TrimmedFixedLengthStringDataset>{
  // define DataType
  // define BufferType
  // implement to_buffer
  // implement from_buffer
};

Obviously one could argue the same for reading attributes into std::string.

I guess we could even provide some template

template<typename T>
TrimmedFixedLengthStringT{
  T data;
};

and provide overloads for FixedLengthStringTrait for, e.g., std::vector<std::string>

template<>
struct FixedLengthStringTrait<TrimmedFixedLengthStringT<std::vector<std::string>>>{
  // implement to_buffer as before and from_buffer with trimming
};

Any thoughts on how I should proceed?

jkotan

Hello @FlyingSamson, your implementation of trimming read attributes looks OK. I'm also not sure what is the best way to implement trimming for datasets. I propose to dedicate the PR only for the attribute part and if anyone has idea and will how to do we can create another PR

FlyingSamson · 2026-04-24T07:24:32Z

That sounds reasonable to me. Then I will remove the two files regarding fixed size datasets again.

FlyingSamson added 5 commits March 20, 2026 10:43

Create test file with fixed-size string attributes

314643e

Implement tests describing expected behavior

672c546

Read only non-padding non-nullterminator chars into std::strings

fe32ffa

Fix other test depending on old behavior

bd9ae04

Make it c++11 compatible again

8d47494

FlyingSamson commented Mar 20, 2026

View reviewed changes

jkotan requested review from ggoneiESS, jkotan and yuelongyu March 20, 2026 16:35

yuelongyu reviewed Mar 23, 2026

View reviewed changes

jkotan requested changes Mar 24, 2026

View reviewed changes

jkotan mentioned this pull request Mar 26, 2026

Release 0.8.0 #698

Open

FlyingSamson added 2 commits March 27, 2026 13:58

Introduce function to switch on or off trimming

dc61ff6

Revert "Fix other test depending on old behavior"

187d480

This reverts commit bd9ae04.

jkotan self-requested a review April 2, 2026 11:35

jkotan approved these changes Apr 23, 2026

View reviewed changes

jkotan requested a review from yuelongyu April 23, 2026 13:42

yuelongyu approved these changes Apr 24, 2026

View reviewed changes

FlyingSamson force-pushed the fixed-size-strings branch from f16bb35 to 187d480 Compare April 24, 2026 07:27

jkotan merged commit cb4e81f into ess-dmsc:master Apr 27, 2026
37 checks passed

Conversation

FlyingSamson commented Mar 20, 2026

Uh oh!

FlyingSamson Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

ggoneiESS commented Mar 20, 2026

Uh oh!

yuelongyu left a comment

Choose a reason for hiding this comment

Uh oh!

jkotan left a comment

Choose a reason for hiding this comment

Uh oh!

ggoneiESS commented Mar 24, 2026

Uh oh!

FlyingSamson commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggoneiESS commented Mar 24, 2026

Uh oh!

yuelongyu commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FlyingSamson commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkotan commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FlyingSamson commented Mar 27, 2026

Uh oh!

FlyingSamson commented Apr 20, 2026

Uh oh!

jkotan left a comment

Choose a reason for hiding this comment

Uh oh!

FlyingSamson commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

FlyingSamson commented Mar 24, 2026 •

edited

Loading

yuelongyu commented Mar 24, 2026 •

edited

Loading

FlyingSamson commented Mar 24, 2026 •

edited

Loading

jkotan commented Mar 24, 2026 •

edited

Loading