Skip to content

Conversation

@lidel
Copy link
Member

@lidel lidel commented Jan 17, 2026

Adds building blocks for reproducible CID generation across IPFS implementations, based on IPIP-499

Users of boxo can now

  • Choose how directory size is estimated for HAMT sharding: legacy link-based (SizeEstimationLinks), accurate block-based (SizeEstimationBlock), or disable size-based thresholds entirely
    (SizeEstimationDisabled)
  • Apply predefined import settings with UnixFS_v0_2015 or UnixFS_v1_2025 profiles from IPIP-499 via ApplyGlobals() and CidBuilder()
  • Resolve all (not just roots) symlinks to their target content during file traversal with SerialFileOptions.DereferenceSymlinks
  • Get consistent HAMT behavior with JS implementation (threshold comparison changed from >= to >) – 6707376

Related: IPIP-499, used by kubo#11148

lidel added 2 commits January 16, 2026 23:42
add configurable size estimation modes for determining when to switch
between BasicDirectory and HAMTDirectory:

- SizeEstimationLinks: legacy mode using len(name) + len(CID), default
- SizeEstimationBlock: full serialized dag-pb block size (accurate)
- SizeEstimationDisabled: link-count only via MaxLinks, ignores size

includes:
- HAMTSizeEstimation global for default mode
- WithSizeEstimationMode option for per-directory override
- helper functions for accurate protobuf size calculation

part of IPIP-499 UnixFS CID Profiles implementation.
introduces UnixFSProfile struct with predefined profiles:
- UnixFS_v0_2015: legacy CIDv0 settings (256 KiB chunks, 174 links/node)
- UnixFS_v1_2025: modern CIDv1 settings (1 MiB chunks, 1024 links/node)

profiles control file chunking, DAG width, and HAMT sharding parameters.
ApplyGlobals() sets all relevant global variables at once.

part of IPIP-499 implementation.
@lidel lidel changed the title feat(unixfS): configurable CID Profiles from IPIP-499 feat(unixfs): configurable CID Profiles from IPIP-499 Jan 17, 2026
@codecov
Copy link

codecov bot commented Jan 17, 2026

Codecov Report

❌ Patch coverage is 90.80460% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.35%. Comparing base (63b6a19) to head (56cf0ae).

Files with missing lines Patch % Lines
ipld/unixfs/io/directory.go 89.31% 8 Missing and 6 partials ⚠️
mfs/dir.go 50.00% 1 Missing and 1 partial ⚠️

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1088      +/-   ##
==========================================
+ Coverage   61.06%   61.35%   +0.28%     
==========================================
  Files         264      265       +1     
  Lines       26221    26353     +132     
==========================================
+ Hits        16013    16169     +156     
+ Misses       8527     8505      -22     
+ Partials     1681     1679       -2     
Files with missing lines Coverage Δ
chunker/parse.go 53.96% <ø> (ø)
files/serialfile.go 77.55% <100.00%> (+12.11%) ⬆️
ipld/unixfs/io/profile.go 100.00% <100.00%> (ø)
mfs/ops.go 46.32% <100.00%> (+0.39%) ⬆️
mfs/dir.go 56.96% <50.00%> (-0.36%) ⬇️
ipld/unixfs/io/directory.go 77.09% <89.31%> (+5.88%) ⬆️

... and 12 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

lidel added 2 commits January 17, 2026 01:32
add SerialFileOptions and NewSerialFileWithOptions to control whether
symlinks are preserved as UnixFS symlink nodes (Data.Type=4) or
dereferenced and replaced with their target content during file
traversal.
Link.Size is already uint64, so the explicit conversions are redundant
and flagged by golangci-lint unconvert check.
HAMT sharding threshold comparison was historically implemented as
`>` in JS and `>=` in Go:

- JS: https://github.com/ipfs/helia/blob/005c2a7/packages/unixfs/src/commands/utils/is-over-shard-threshold.ts#L31
- Go: https://github.com/ipfs/boxo/blob/319662c/ipld/unixfs/io/directory.go#L438

This inconsistency meant a directory exactly at the 256 KiB threshold
would stay basic in JS but convert to HAMT in Go, producing different
CIDs for the same input.

This commit changes Go to use `>` (matching JS), so a directory exactly
at the threshold now stays as a basic (flat) directory. This aligns
cross-implementation behavior for CID determinism per IPIP-499.

Also adds SizeEstimationMode to MkdirOpts so MFS directories respect
the configured estimation mode instead of always using the global default.
@lidel lidel force-pushed the feat/ipip-499-unixfs-2025 branch from b844fc9 to 6707376 Compare January 19, 2026 04:37
lidel added 2 commits January 20, 2026 00:03
- fix trailing newline in directory_test.go
- add #1088 PR references to changelog entries
- files: fix nil filter check in serialFile.Size()
- unixfs/io: document thread-safety for global vars and ApplyGlobals
- changelog: move DefaultBlockSize to Changed section with breaking marker
// Thread safety: this function modifies global variables and is not safe
// for concurrent use. Call it once during program initialization, before
// starting any imports. Do not call from multiple goroutines.
func (p UnixFSProfile) ApplyGlobals() {
Copy link
Member Author

@lidel lidel Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ℹ️ boxo already had globals (DefaultBlockSize, DefaultLinksPerBlock etc)

this is a very surgical way of having predefined UnixFS profiles in boxo itself that users can apply programmatically at startup of their app.

i dont like it tbh, but others in golang that we already use (like certmagic, or even net.DefaultResolver) have similar way of managing global defaults, so maybe its just me not being a fan of globals.

this is a compromise which delivers ability to set-and-forget profile on startup, but implemented in smallest amount of code that avoids breaking every user of existing APIs

@lidel lidel marked this pull request as ready for review January 20, 2026 02:14
@lidel lidel requested a review from a team as a code owner January 20, 2026 02:14
Copy link
Contributor

@gammazero gammazero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a circular symlink test to show that it works when not dereferencing symlinks, otherwise it fails.

@lidel
Copy link
Member Author

lidel commented Jan 27, 2026

Triage:

  • now that we agreed on SizeEstimationBlock we can optimize to avoid unnecessary calculations and also add tests for edge cases like mode/mtime changes

@lidel lidel marked this pull request as draft January 27, 2026 16:12
lidel added 3 commits January 27, 2026 18:24
IPIP-499 block-bytes estimation improvements:

- add fast path optimization in needsToSwitchByBlockSize to skip
  expensive exact calculation when clearly above threshold (+256 margin)
- clarify documentation for linkSerializedSize, calculateBlockSize,
  cachedBlockSize, and SetStat methods with IPIP-499 context
- extract saveAndRestoreGlobals as package-level test helper

tests for mode/mtime block size overhead:
- verify exact protobuf overhead: mode (3 bytes), mtime seconds (8 bytes),
  nanoseconds (5 bytes), combined (16 bytes)
- verify cachedBlockSize accuracy after add/remove/replace operations
- verify linkSerializedSize matches actual link contribution
- verify HAMT threshold accounts for metadata overhead
- test fast path and near-boundary exact calculation behavior
…utable

consolidate fragmented size tracking into a single method and field:
- merge `cachedBlockSize` into `estimatedSize` (single field for all modes)
- replace `addToEstimatedSize`, `removeFromEstimatedSize`, and
  `updateCachedBlockSize` with unified `updateEstimatedSize(name, oldLink, newLink)`
- remove `SetSizeEstimationMode` from Directory interface; mode is now
  set only at creation time via `WithSizeEstimationMode` option

this prevents mode changes after directory creation which could cause
size tracking inconsistencies, and simplifies the calling code from
two method calls per operation to one.

test coverage:
- TestHAMTToBasicDowngrade: new test for HAMT->Basic threshold boundaries
  covering both SizeEstimationLinks and SizeEstimationBlock modes
- TestEstimatedSizeAccuracy: verifies size tracking after add/remove/replace
- TestProfileHAMTThresholdBehavior: upgrade threshold boundaries
- TestDynamicDirectorySwitch: Basic<->HAMT conversions
removes the need for protobuf serialization when checking HAMT threshold.
the block size is now computed arithmetically from protobuf field definitions:
- dataFieldSerializedSize(): UnixFS Data field (Type + optional mode/mtime)
- linkSerializedSize(): PBLink fields (Hash, Name, Tsize) + wrapper

this replaces the previous approach that serialized a temporary node copy
when near the threshold boundary. the arithmetic calculation is exact and
verified against actual serialization in TestDataFieldSerializedSizeMatchesActual.

calculateBlockSize() moved to test-only code in profile_test.go.
@lidel lidel force-pushed the feat/ipip-499-unixfs-2025 branch from 19fb7a1 to 6141039 Compare January 27, 2026 20:29
@lidel lidel marked this pull request as ready for review January 28, 2026 00:18
@lidel
Copy link
Member Author

lidel commented Jan 28, 2026

@gammazero @aschmahmann i've applied feedback from today's colo

innerSize := 0

// Type field (field 1, varint): Directory = 1
// tag = (1 << 3) | 0 = 8 (1 byte), value = 1 (1 byte)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not immediately clear what the | 0 = 8 part of this comment means.

// Can be modified to change the default for all subsequent chunker operations.
// For CID-deterministic imports, prefer using UnixFSProfile presets from
// ipld/unixfs/io/profile.go which set this and other related globals.
var DefaultBlockSize int64 = 1024 * 256
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be best to restrict the block size to always being a factor of ChunkSize? If so, then this could be changed to BlocksPerChunk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants