Chunked writing + compression proof of concept#673
Chunked writing + compression proof of concept#673cstoeckl wants to merge 1 commit intojamesmudd:masterfrom
Conversation
jamesmudd
left a comment
There was a problem hiding this comment.
Looks like there are a lot of unrelated changes at the moment. Eg. Reverting the copyright year and white space changes.
Could you merge master into this and clean these up so the diff is smaller?
605b8ce to
bce0499
Compare
|
Cleaned up pull request to remove formatting only changes Sorry about the confusion, this is my first time collaborating using |
jamesmudd
left a comment
There was a problem hiding this comment.
Lots of great work here. I would consider breaking it up, here are a few things I consider separate.
- The implementation of btree writing
- The extension of the
Filterinterface to add writing and impl of deflate - The addition of
UFixed- this is a little debatable to me, as Java doesn't have unsigned types by definition anything you want to write must be signed, if there is a reason this would be nice to flag unsigned in the file we could consider it but for a separate discussion IMO.
I would reconsider the use of Object header v1 and btree v1. Just forward thinking I don't really intend to support using object header v1. If you take a look at the spec for latest format (i.e. object header v2) Apendix c https://support.hdfgroup.org/documentation/hdf5/latest/_f_m_t3.html#AppendixC then you can see the supported chunk indexes, if you want btree then you would look at btree v2, but first I would look at single chunk.
So great you have got this to work but think there is quite a bit to cleanup here. Consider breaking the change up into stages. I would probably suggest looking a single chunk or btree v2 first to avoid the chnages required for object header v1. To get something merged would also want unit tests with decent coverage, there are quite a few examples for writing to follow including read verification with h5dump to check the file reads with the HDF5 group lib as well.
| */ | ||
| byte[] decode(byte[] encodedData, int[] filterData); | ||
|
|
||
| byte[] encode(byte[] data, int[] filterData); |
There was a problem hiding this comment.
Think we will need a default impl here that throws an exception as this is public API and people might have implemented custom filters.
There was a problem hiding this comment.
I've reverted this change and only added the encode method to the DeflatePipelineFilter.
Once you have decided on the right public API for the encode method for all filters, we can adjust this method accordingly.
There was a problem hiding this comment.
I think the API itself if good, just that we need to make it compatible with existing Filter implementations. if we add a default e.g.
default byte[] encode(byte[] data, int[] filterData) {
throw new UnsupportedHdfException(String.format("[%s (%d)] does not support encoding", getName(), getId()));
}
Then all the existing code will compile. encode implementation can be gradually added as people want to support those filters. But all exisrting filters only used for reading will keep working.
There was a problem hiding this comment.
Sound like a very good approach. It's done and uploaded.
| private static final Logger logger = LoggerFactory.getLogger(FilterManager.class); | ||
|
|
||
| private static final Map<Integer, Filter> ID_TO_FILTER = new HashMap<>(); | ||
| public static final Map<Integer, Filter> ID_TO_FILTER = new HashMap<>(); |
There was a problem hiding this comment.
Hoping with some refactoring we can avoid the need to expose this
There was a problem hiding this comment.
I noticed that you prefer the getter/setter approach which does not expose internal variables.
I've added a getFilter method, which keeps ID_TO_FILTER private.
/**
* Retrieves a filter.
*
* @param filterId the ID of the filter to retrieve
* @throws HdfFilterException if the filterId is not valid
*/
'public static Filter getFilter(int filterId) {
Filter filter = ID_TO_FILTER.get(filterId);
logger.info("Retrieved HDF5 filter '{}' with ID '{}'", filter.getName(), filter.getId());
return filter;
}'
| // addFilter(new ByteShuffleFilter()); | ||
| // addFilter(new FletcherChecksumFilter()); | ||
| // addFilter(new LzfFilter()); | ||
| // addFilter(new BitShuffleFilter()); | ||
| // addFilter(new Lz4Filter()); |
There was a problem hiding this comment.
Adding default method should allow this to be reverted
There was a problem hiding this comment.
I only commented these extra filters out to simplify my development environment. I'll revert this in my next pull request. As I said, I'm still learning all these tools.
| } | ||
| } | ||
|
|
||
| public BufferBuilder writeInts(int[] ints) { |
There was a problem hiding this comment.
Think this csn be simplified
public BufferBuilder writeInts(int[] ints) {
for (int i=0; i < ints.length; i++) {
writeInt(ints[i]);
}
return this;
}
| } | ||
| } | ||
|
|
||
| public BufferBuilder writeLongs(long[] longs) { |
| return signed; | ||
| } | ||
|
|
||
| public void setSigned(boolean sig) { |
There was a problem hiding this comment.
The design I have up to now is to favour immutable objects, think I would like to stick to this where possible. We might need to introduce some kind of DatasetBuilder though as we would want the ability to specify more options like filters, filter options, chunk size.
There was a problem hiding this comment.
Understood. Given that you prefer not to use unsigned fixed points, I'll work on removing this feature from the pull request. I'll update the comment once I'm done.
| logger.debug("Reading implicit indexed dataset"); | ||
| chunkIndex = new ImplicitChunkIndex(layoutMessage.getAddress(), datasetInfo); | ||
| break; | ||
| throw new UnsupportedHdfException("Implicit indexing is currently not supported"); |
There was a problem hiding this comment.
This looks like a merge issue?
|
I'll look into the issues with the V1 vs. V2 headers. I agree it is preferable to write only the V2 versions, but I'm not sure that all applications that will read the H5 files support the new features. I agree tests are important and valuable, but I'm not familiar with the unit test system. I managed to switch to a V2 Superblock and use V2 ObjectHeaders. I'm not sure if it is worth the effort to switch to V2 BTree, since I don't have any examples/code that needs these. The edits are pushed to Github. |
b390964 to
55f3a1a
Compare
55f3a1a to
343354a
Compare
No description provided.