Skip to content

Read fixed size strings without padding characters#692

Merged
jkotan merged 7 commits intoess-dmsc:masterfrom
FlyingSamson:fixed-size-strings
Apr 27, 2026
Merged

Read fixed size strings without padding characters#692
jkotan merged 7 commits intoess-dmsc:masterfrom
FlyingSamson:fixed-size-strings

Conversation

@FlyingSamson
Copy link
Copy Markdown
Contributor

This would be one way to overcome the current issue that when reading fixed-size strings into std::strings all padding characters are also returned.

This will certainly introduce a certain overhead compared to not trimming the read string (I did not perform any benchmarks though) which need to be contemplated.
On the other hand, in cases where the caller actually wants the trimmed string, forcing him to trim the string himself after it has been created might incur even larger costs.

Resolves #661

Also relevant for #215 and #224


AND_GIVEN("an instance of a smaller string") {
std::string write = "hell";
THEN("we have to resize the string to the appropriate size") {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not entirely sure, why this was required in the first place.

From what I understand the code will write shorter strings just fine. This line merely ensured that in the REQUIRE_THAT below write also contained trailing \0 characters.

@ggoneiESS
Copy link
Copy Markdown
Member

I don't quite have the mental capacity to deal with this on a Friday evening, but I would refer you to my comments on a different PR:

ess-dmsc/streaming-data-types#102 (comment)

variable-length datasets cannot be compressed
the data no longer exists contiguously (it necessarily becomes an array of pointers to strings, rather than just raw data)
And (academic but technical arguments)

heap storage requires more space than regular 'raw data' storage (i.e. how the HDF5 object exists in memory)
general reduction in I/O efficiency because it requires individual write operations for each data element rather than one write per dataset chunk (actually, chunking isn't allowed at all)
Performance is definitely at a premium V storage.

I found this via the HDF5 clinic - https://steven-varga.ca/blog/hdf5-fixed-vs-variable-benchmark/ and it provides a CPP file. It might be possible to incorporate into a filewriter test.

Surely if it's like to be a different length wouldn't it be better to use VarLengthString instead anyway?

Copy link
Copy Markdown
Collaborator

@yuelongyu yuelongyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @FlyingSamson,

Thanks for the PR.

The datatype::StringPad::NullTerm: I agree, it could be removed. However, the padding part, I think it would be good to stay as it is. It might be not the behavior use expected, user writes the string with space padding intentionally and reads it back with different length. For example, in some protocol, it requires fixed length of commands, so padding is needed. If a user save such a command in file and read it back, the user would expect the command is same as what it's written there.

Copy link
Copy Markdown
Collaborator

@jkotan jkotan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FlyingSamson, thank you very much for your work. Is it possible for you to leave in the PR on the part with '\0' term (without padding part) so we could release it as well with v0.8.0. I don't think removing padding is good think to do. Is there any deeper reason to remove them?

@ggoneiESS
Copy link
Copy Markdown
Member

If this isn't a use case for VarLengthString can I suggest we go slightly further and instead rename the proposed from_bufferto e.g. from_buffer_trim and use a flag (default = False) in from_buffer to call the proposed method? Then we don't have to worry about the effects downstream.

Particularly regarding

this will certainly introduce a certain overhead compared to not trimming the read string (I did not perform any benchmarks though) which need to be contemplated.

@FlyingSamson
Copy link
Copy Markdown
Contributor Author

FlyingSamson commented Mar 24, 2026

Thank you all for your feedback.

@ggoneiESS Regarding your comment

Surely if it's like to be a different length wouldn't it be better to use VarLengthString instead anyway?

Unfortunately I am not in control of the file layout containing these strings. In this particular case all attributes in the file are semantically of variable length but stored as fixed size length with length equal to the actual string length + 1 (for the trailing \0).

Regarding the cited comment: I think that reading a dataset into multiple std::strings will inevitably result in an "array of pointers to strings" (except if strings are sufficiently small for SSO).

@yuelongyu, @jkotan I certainly can remove the changes for padded string and only leave the part for null-terminated strings. However, the way I understand the inner workings of to_buffer the user could as well write a std::string of, say, lenght 3 into a fixed size string of length 10 in the hdf5 file without explicitly adding the padding to his std::string. Thus the argument

, user writes the string with space padding intentionally and reads it back with different length. For example, in some protocol, it requires fixed length of commands, so padding is needed. If a user save such a command in file and read it back, the user would expect the command is same as what it's written there.

might not apply universally. But I'm totally fine with leaving it as is.

@ggoneiESS Regarding

If this isn't a use case for VarLengthString can I suggest we go slightly further and instead rename the proposed from_bufferto e.g. from_buffer_trim and use a flag (default = False) in from_buffer to call the proposed method? Then we don't have to worry about the effects downstream.

Sure, I can have a look at that. However, I'm not sure yet, how this would be passed along from higher-level functions or at which point this decision would be made.

@ggoneiESS
Copy link
Copy Markdown
Member

Possibly just around src/h5cpp/node/dataset.hpp:814 and src/h5cpp/attribute/attribute.hpp:272

@yuelongyu
Copy link
Copy Markdown
Collaborator

yuelongyu commented Mar 24, 2026

HI @FlyingSamson ,

Regarding the padding, I agree with you, this doesn't apply universally, actually removing or keeping padding is a debatable thing here, that's why I would prefer to keep the current behavior.

@FlyingSamson
Copy link
Copy Markdown
Contributor Author

FlyingSamson commented Mar 24, 2026

If we indeed pass this down from the user to from_buffer, should we keep the option also for trimming the padded strings?

That is, give the user the option to also trim strings with null-charcter or space padding.

@jkotan
Copy link
Copy Markdown
Collaborator

jkotan commented Mar 24, 2026

Hi @FlyingSamson and @ggoneiESS , sure for me we could keep removing of padding characters but not as a default behavior, i.e. we could create overloaded from_buffer method or with another name or with some addition default parameters (if it works with traits).
The point is the removing padding can remove during the reading then it was writer e.g. when written string ends with padding characters. E.g. you write " A B " with ' ' padding and then you can get after reading only " A B" (so you loose the information)

Of course we could skip removing padding

@jkotan jkotan mentioned this pull request Mar 26, 2026
@FlyingSamson
Copy link
Copy Markdown
Contributor Author

Unfortunately, I did not finish the implementation for datasets as well yet and will be out of office until mid of April. I will continue working on this when I'm back. Just so that you do not wonder if I do not reply immediately during the next weeks.

But please feel free to leave additional comments on the existing implementation if you have any.

@jkotan jkotan self-requested a review April 2, 2026 11:35
@FlyingSamson
Copy link
Copy Markdown
Contributor Author

Implementing the different code path for reading attributes into std::string was relatively straight forward. However, looking at the current implementation within node/dataset.hpp, I'm not quite sure how to best implements the same thing for datasets.

My main concern is, that while T in case of attributes was std::string it is now some container of something that might be std::string, making writing a explicit specialization like in the attribute case impossible without knowledge about how to retrieve the underlying type of the container.
On top adding the bool trim argument to all read functions would yield a lot of overloads, because they already have a argument with default value for dtpl.

I wonder if one should leave (at least) dataset as is and let users handle this on their own by writing a thin wrapper around whichever container they wish to use and implement FixedLengthStringTrait for that wrapper. At least I think that this should work, e.g., something like

struct TrimmedFixedLengthStringDataset {
  std::vector<std::string> data;
};
template<>
struct FixedLengthStringTrait<TrimmedFixedLengthStringDataset>{
  // define DataType
  // define BufferType
  // implement to_buffer
  // implement from_buffer
};

Obviously one could argue the same for reading attributes into std::string.

I guess we could even provide some template

template<typename T>
TrimmedFixedLengthStringT{
  T data;
};

and provide overloads for FixedLengthStringTrait for, e.g., std::vector<std::string>

template<>
struct FixedLengthStringTrait<TrimmedFixedLengthStringT<std::vector<std::string>>>{
  // implement to_buffer as before and from_buffer with trimming
};

Any thoughts on how I should proceed?

Copy link
Copy Markdown
Collaborator

@jkotan jkotan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @FlyingSamson, your implementation of trimming read attributes looks OK. I'm also not sure what is the best way to implement trimming for datasets. I propose to dedicate the PR only for the attribute part and if anyone has idea and will how to do we can create another PR

@jkotan jkotan requested a review from yuelongyu April 23, 2026 13:42
@FlyingSamson
Copy link
Copy Markdown
Contributor Author

That sounds reasonable to me. Then I will remove the two files regarding fixed size datasets again.

@jkotan jkotan merged commit cb4e81f into ess-dmsc:master Apr 27, 2026
37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fixed-size string attributes include \0 in resulting std::string

4 participants