Skip to content

Comments

detect NetCDF and HDF5 files based on content #9117#9152

Merged
kcondon merged 3 commits intodevelopfrom
9117-file-type-detection
Nov 22, 2022
Merged

detect NetCDF and HDF5 files based on content #9117#9152
kcondon merged 3 commits intodevelopfrom
9117-file-type-detection

Conversation

@pdurbin
Copy link
Member

@pdurbin pdurbin commented Nov 8, 2022

What this PR does / why we need it:

Detection of NetCDF and HDF5 files is currently based on file extensions rather than content.

Which issue(s) this PR closes:

Special notes for your reviewer:

The Java library I'm adding ( https://github.com/Unidata/netcdf-java ) comes from the makers of NetCDF.

My thought was to introduce it here to simply detect NetCDF and HDF5 files and in subsequent pull requests we can make more use of it, such as this issue:

Suggestions on how to test this:

Rename files to have no or non-standard file extensions to see if they are detected properly.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation:

@coveralls
Copy link

coveralls commented Nov 8, 2022

Coverage Status

Coverage increased (+0.002%) to 19.976% when pulling 224887b on 9117-file-type-detection into 948c0d0 on develop.

@sekmiller sekmiller removed their assignment Nov 15, 2022
@kcondon kcondon self-assigned this Nov 15, 2022
@kcondon
Copy link
Contributor

kcondon commented Nov 15, 2022

@pdurbin I've tested using example files from the net and works with extension but with diff extension, fails. Is there some variation in the formats where magic number test isn't valid? Any ideas? I tried examples from https://www.unidata.ucar.edu/software/netcdf/examples/files.html and https://people.sc.fsu.edu/~jburkardt/data/hdf/hdf.html . I can try more from different sources.

OK, I've found some that work when extension changes but some fail. Not sure if there is something optional in formatting where in the field folks do not produce complying files? It looks like the starting bytes on the working files contain the expected magic numbers but do not on the ones not working, though they came from the same place.

Update: I've done some more checking into the format and the lib. My understanding of this issue and code is we are simply invoking a file open using their library and that results in a file type identification. For whatever reason, some of the example files I used for testing were not identifiable. I tried checking the magic number and got some mixed results. The magic number according to the lib doc is the first 4 bytes typically, like CDF and then a version number. You can see it using unix command line,

od -An -c -N4 sresa1b_ncar_ccsm3-example.nc
C D F 001

od -An -c -N4 NEONDSTowerTemperatureDatahdf5.chk
211 H D F

However, the HDF format may have extra bytes before the magic number, according to the doc. Also, under inferring format type, there is a set of checks in order of precedence, with file contents, presumably magic number, etc, being the first.

So, in conclusion, it is unclear to me how to absolutely validate either a netcdf of hdf5 file for testing as there is a range of versions and content, and even if there were some that fail to be identified, we are relying on a provided library by the owners of the project. I had mixed experience with uploading sample files. I could try taking a small sample in a more controlled manner, checking the first 4 bytes for conforming magic numbers, then checking identification to see whether any with legit magic numbers are not identified. That would provide some level of confidence in this identification mechanism.

https://docs.unidata.ucar.edu/netcdf-c/current/file_format_specifications.html#classic_format_spec
https://docs.unidata.ucar.edu/netcdf-c/current/md_internal.html#autotoc_md327

It appears that netCDF supports a specially formatted HDF5, a subset of HD5 spec. It won't read HDF5 files that don't adhere to the rules, though not sure how much of a difference there might be: https://docs.unidata.ucar.edu/netcdf-c/current/interoperability_hdf5.html and https://www.scivision.dev/netcdf4-vs-hdf5/#:~:text=The%20main%20idea%20behind%20NetCDF4,to%20NetCDF4%20ease%20of%20use.

So, I'm guessing this lib we're using may not work with some native HDF5 files but will with those compatible with netCDF rules.

Unexpectedly, when I downloaded a sampling of HDF and netCDF files from Harvard repo, all the HDF and netCDF with hdf magicnum identified when extension changed but none of the netCDF files with CDF magicnum were identified when extension changed. Those netCDF that had magicnum of HDF were identified as network common data whereas the HDF files were identified as hierarchical data. I'm done experimenting. I can only say the lib integration works and it identifies some file by content but not sure which ones as a general rule or how many.

For reference, searched Harvard Dataverse for:
fileType:"Network Common Data Form"
"Network Common Data Form"
fileType:"Hierarchical Data Format"
"Hierarchical Data Format"

@pdurbin pdurbin self-assigned this Nov 18, 2022
@pdurbin
Copy link
Member Author

pdurbin commented Nov 21, 2022

@kcondon thanks. In my testing, I only discovered the following three "file ID types" (NETCDF, NetCDF-4, and HDF5) that were detected based on NetCDF and HDF5 files I found:

        switch (type) {
            case "NETCDF":
                return "application/netcdf";
            case "NetCDF-4":
                return "application/netcdf";
            case "HDF5":
                return "application/x-hdf5";

I'd be happy to test with additional files. As usual, you've been more thorough than I have! Let's please compare notes (and files!) when we get a chance. I've checked a few small files into the branch/pull request for testing but most of them are too big for this.

@pdurbin
Copy link
Member Author

pdurbin commented Nov 22, 2022

@kcondon as we discussed offline, I had a typo in the code such that "classic" NetCDF files (pre version 4) were not being detected correctly. I pushed a fix in 03188e7

I also tried to better document in the release note what people should expect. Here's what it says now:

"NetCDF and HDF5 files are now detected based on their content rather than just their file extension.

Both "classic" NetCDF 3 files and more modern NetCDF 4 files are detected based on content.

Detection for HDF4 files is only done through the file extension ".hdf", as before."

@pdurbin pdurbin removed their assignment Nov 22, 2022
@kcondon kcondon merged commit 6b1ffa7 into develop Nov 22, 2022
@kcondon kcondon deleted the 9117-file-type-detection branch November 22, 2022 19:35
@pdurbin pdurbin added this to the 5.13 milestone Nov 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve file type detection of NetCDF and HDF5

4 participants