detect NetCDF and HDF5 files based on content #9117#9152
Conversation
|
@pdurbin I've tested using example files from the net and works with extension but with diff extension, fails. Is there some variation in the formats where magic number test isn't valid? Any ideas? I tried examples from https://www.unidata.ucar.edu/software/netcdf/examples/files.html and https://people.sc.fsu.edu/~jburkardt/data/hdf/hdf.html . I can try more from different sources. OK, I've found some that work when extension changes but some fail. Not sure if there is something optional in formatting where in the field folks do not produce complying files? It looks like the starting bytes on the working files contain the expected magic numbers but do not on the ones not working, though they came from the same place. Update: I've done some more checking into the format and the lib. My understanding of this issue and code is we are simply invoking a file open using their library and that results in a file type identification. For whatever reason, some of the example files I used for testing were not identifiable. I tried checking the magic number and got some mixed results. The magic number according to the lib doc is the first 4 bytes typically, like CDF and then a version number. You can see it using unix command line, od -An -c -N4 sresa1b_ncar_ccsm3-example.nc od -An -c -N4 NEONDSTowerTemperatureDatahdf5.chk However, the HDF format may have extra bytes before the magic number, according to the doc. Also, under inferring format type, there is a set of checks in order of precedence, with file contents, presumably magic number, etc, being the first. So, in conclusion, it is unclear to me how to absolutely validate either a netcdf of hdf5 file for testing as there is a range of versions and content, and even if there were some that fail to be identified, we are relying on a provided library by the owners of the project. I had mixed experience with uploading sample files. I could try taking a small sample in a more controlled manner, checking the first 4 bytes for conforming magic numbers, then checking identification to see whether any with legit magic numbers are not identified. That would provide some level of confidence in this identification mechanism. https://docs.unidata.ucar.edu/netcdf-c/current/file_format_specifications.html#classic_format_spec It appears that netCDF supports a specially formatted HDF5, a subset of HD5 spec. It won't read HDF5 files that don't adhere to the rules, though not sure how much of a difference there might be: https://docs.unidata.ucar.edu/netcdf-c/current/interoperability_hdf5.html and https://www.scivision.dev/netcdf4-vs-hdf5/#:~:text=The%20main%20idea%20behind%20NetCDF4,to%20NetCDF4%20ease%20of%20use. So, I'm guessing this lib we're using may not work with some native HDF5 files but will with those compatible with netCDF rules. Unexpectedly, when I downloaded a sampling of HDF and netCDF files from Harvard repo, all the HDF and netCDF with hdf magicnum identified when extension changed but none of the netCDF files with CDF magicnum were identified when extension changed. Those netCDF that had magicnum of HDF were identified as network common data whereas the HDF files were identified as hierarchical data. I'm done experimenting. I can only say the lib integration works and it identifies some file by content but not sure which ones as a general rule or how many. For reference, searched Harvard Dataverse for: |
|
@kcondon thanks. In my testing, I only discovered the following three "file ID types" (NETCDF, NetCDF-4, and HDF5) that were detected based on NetCDF and HDF5 files I found: I'd be happy to test with additional files. As usual, you've been more thorough than I have! Let's please compare notes (and files!) when we get a chance. I've checked a few small files into the branch/pull request for testing but most of them are too big for this. |
Also fix test so it doesn't rely on the file extension ".nc".
|
@kcondon as we discussed offline, I had a typo in the code such that "classic" NetCDF files (pre version 4) were not being detected correctly. I pushed a fix in 03188e7 I also tried to better document in the release note what people should expect. Here's what it says now: "NetCDF and HDF5 files are now detected based on their content rather than just their file extension. Both "classic" NetCDF 3 files and more modern NetCDF 4 files are detected based on content. Detection for HDF4 files is only done through the file extension ".hdf", as before." |
What this PR does / why we need it:
Detection of NetCDF and HDF5 files is currently based on file extensions rather than content.
Which issue(s) this PR closes:
Special notes for your reviewer:
The Java library I'm adding ( https://github.com/Unidata/netcdf-java ) comes from the makers of NetCDF.
My thought was to introduce it here to simply detect NetCDF and HDF5 files and in subsequent pull requests we can make more use of it, such as this issue:
Suggestions on how to test this:
Rename files to have no or non-standard file extensions to see if they are detected properly.
Does this PR introduce a user interface change? If mockups are available, please link/include them here:
Is there a release notes update needed for this change?:
Additional documentation: