refactor: move pattern mining to library layer for better integration#284
refactor: move pattern mining to library layer for better integration#284
Conversation
This refactor moves pattern mining capabilities from the CLI layer into the core Mutator library, enabling programmatic use and better separation of concerns. Key changes: - Add Mode field to Options struct (default, discover, both) - Implement concurrent execution of discover and default modes - Move domain validation to util.go with proper error handling - Add Options.Validate() for centralized validation logic - Maintain backward compatibility (empty Mode defaults to "default") - Fix edge case in getNValidateRootDomain for all-empty domains Benefits: - Library users can now use pattern mining programmatically - Better code organization with clear separation of concerns - Concurrent execution improves performance in "both" mode - More robust error handling for edge cases 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
WalkthroughCLI simplified to a single execution path: Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as cmd/alterx
participant Mut as alterx.Mutator
participant Miner as patternmining.Miner
participant Gen as Generator
CLI->>Mut: Build & Validate alterx.Options
Note right of Mut: Init Miner if Mode == "discover" or "both"
CLI->>Mut: ExecuteWithWriter(output)
par Input generation
Mut->>Gen: Generate from inputs
and Pattern mining (async)
Mut->>Miner: Start mining
Miner-->>Mut: miningResult
Mut->>Gen: Generate from mined patterns
end
Gen-->>Mut: candidates
Mut->>Mut: Merge & dedupe
Mut-->>CLI: Write output
alt SaveRules requested
CLI->>Mut: SaveRules(path)
Mut->>Miner: Export rules
Miner-->>CLI: Rules persisted
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
🔇 Additional comments (2)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
cmd/alterx/main.go(2 hunks)mutator.go(5 hunks)util.go(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
mutator.go (4)
internal/runner/runner.go (1)
Options(17-40)internal/patternmining/patternmining.go (3)
Miner(82-85)Result(47-50)NewMiner(88-93)config.go (1)
DefaultConfig(18-18)replacer.go (1)
Replace(19-27)
cmd/alterx/main.go (2)
mutator.go (1)
Options(29-60)internal/runner/runner.go (1)
Options(17-40)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Test Builds (macOS-latest, 1.21.x)
- GitHub Check: Analyze (go)
- GitHub Check: Test Builds (windows-latest, 1.21.x)
Fixes critical bug where SaveRules was called before pattern mining completed, causing miningResult to be nil. Changes: - Move SaveRules block after ExecuteWithWriter call - Ensures mining goroutine completes before saving rules - Skips SaveRules in estimate mode (no execution happens) - Add explanatory comment about execution order requirement Resolves CodeRabbit review comment. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
cmd/alterx/main.go (1)
67-72: SaveRules correctly placed after execution.The past review comment flagged that
SaveRuleswas called before mining, causing it to always fail. The refactor now correctly callsSaveRulesafterExecuteWithWriter(line 61), ensuring thatm.miningResultis populated before attempting to save. Well done!
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cmd/alterx/main.go(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
cmd/alterx/main.go (2)
mutator.go (1)
Options(29-60)internal/runner/runner.go (1)
Options(17-40)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Test Builds (macOS-latest, 1.21.x)
- GitHub Check: Test Builds (windows-latest, 1.21.x)
🔇 Additional comments (2)
cmd/alterx/main.go (2)
33-33: Verify the hardcoded MaxLength value.
MaxLengthis hardcoded to1000without explanation. Confirm whether this should be exposed as a configurable CLI option or if this default is intentional for all use cases.
61-72: Verify SaveRules behavior when ExecuteWithWriter fails.If
ExecuteWithWriterreturns an error (line 61), mining may not have completed successfully, yetSaveRules(line 69) still executes. Confirm whether incomplete mining results should be saved or ifSaveRulesshould be skipped whenerr != nil.Consider applying this diff to skip saving rules on execution failure:
// Execute mutator (handles all modes internally) if err = m.ExecuteWithWriter(output); err != nil { gologger.Error().Msgf("failed to execute alterx: %v", err) + return } gologger.Info().Msgf("Generated %d total unique subdomains", m.PayloadCount()) // Save rules if requested (must be after Execute to ensure mining is complete) if cliOpts.SaveRules != "" { if err := m.SaveRules(cliOpts.SaveRules); err != nil { gologger.Error().Msgf("failed to save rules: %v", err) } }
… mining This commit addresses timing accuracy and prevents resource exhaustion on complex inputs. The timing measurement now captures the full execution duration across parallel goroutines, and input validation rejects problematic domains (>5 levels or >10 tokens) to avoid blocking issues. Also improves user experience by removing duplicate logging and clarifying output messages. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Fixes GitHub Actions Build Test failures caused by concurrent read/write access to timeTaken field. The cleanup goroutine writes to timeTaken while the main goroutine reads it via Time() method, creating a race condition detected by Go's race detector. Solution: Use sync/atomic operations (StoreInt64/LoadInt64) for thread-safe access to timeTaken field, ensuring proper synchronization across goroutines without blocking overhead. Tested locally with race detector - all tests pass. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
.gitignore(1 hunks)CLAUDE.md(6 hunks)cmd/alterx/main.go(2 hunks)internal/patternmining/patternmining.go(1 hunks)mutator.go(6 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
cmd/alterx/main.go (2)
mutator.go (1)
Options(29-60)internal/runner/runner.go (1)
Options(17-40)
mutator.go (3)
internal/patternmining/patternmining.go (4)
Options(27-44)Miner(82-85)Result(47-50)NewMiner(88-93)internal/runner/runner.go (1)
Options(17-40)config.go (1)
DefaultConfig(18-18)
🪛 GitHub Actions: 🔨 Build Test
cmd/alterx/main.go
[error] 61-61: Go data race detected; application exited with race detector enabled. See stack traces for details.
mutator.go
[error] 467-467: DATA RACE detected in Mutator.Time() (read) during race-enabled run.
[error] 263-263: DATA RACE detected in Mutator.Execute.func3() (write) during race-enabled run.
[error] 286-286: DATA RACE detected in Mutator.ExecuteWithWriter() (concurrent access) during race-enabled run.
[error] 280-280: DATA RACE detected in Mutator.ExecuteWithWriter() (concurrent access) during race-enabled run.
| sum := 0 | ||
| for _, token := range tokens[0] { | ||
| sum += len(token) | ||
| } | ||
| // to avoid expensive computation skip any subdomain that can be tokenized into more than 10 tokens | ||
| if sum > 10 { | ||
| // ex: api1dev-home-us1..... basically even if subdomain levels are less than 5 but have too many | ||
| // seperators the vector length would become too long | ||
| gologger.Verbose().Msgf("Rejecting input: %s since it can be tokenized into more than 10 tokens", host) | ||
| continue | ||
| } |
There was a problem hiding this comment.
Rejecting valid inputs due to wrong “token” metric
This branch aims to skip hosts that explode into >10 tokens, but it sums the character lengths of each token. Any realistic subdomain like payments-api.internal.example.com (8+3+8 characters) now exceeds the threshold and gets rejected even though it only has three tokens. That breaks pattern mining for the majority of real-world inputs.
Replace the character-sum with an actual token-count check so we only skip when len(tokens[0]) itself exceeds the limit, and update the log accordingly.
- sum := 0
- for _, token := range tokens[0] {
- sum += len(token)
- }
- // to avoid expensive computation skip any subdomain that can be tokenized into more than 10 tokens
- if sum > 10 {
- // ex: api1dev-home-us1..... basically even if subdomain levels are less than 5 but have too many
- // seperators the vector length would become too long
- gologger.Verbose().Msgf("Rejecting input: %s since it can be tokenized into more than 10 tokens", host)
+ tokenCount := len(tokens[0])
+ // to avoid expensive computation skip any subdomain that can be tokenized into more than 10 tokens
+ if tokenCount > 10 {
+ // ex: api1dev-home-us1..... basically even if subdomain levels are less than 5 but have too many
+ // separators the vector length would become too long
+ gologger.Verbose().Msgf("Rejecting input: %s since it has more than 10 tokens", host)
continue
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| sum := 0 | |
| for _, token := range tokens[0] { | |
| sum += len(token) | |
| } | |
| // to avoid expensive computation skip any subdomain that can be tokenized into more than 10 tokens | |
| if sum > 10 { | |
| // ex: api1dev-home-us1..... basically even if subdomain levels are less than 5 but have too many | |
| // seperators the vector length would become too long | |
| gologger.Verbose().Msgf("Rejecting input: %s since it can be tokenized into more than 10 tokens", host) | |
| continue | |
| } | |
| tokenCount := len(tokens[0]) | |
| // to avoid expensive computation skip any subdomain that can be tokenized into more than 10 tokens | |
| if tokenCount > 10 { | |
| // ex: api1dev-home-us1..... basically even if subdomain levels are less than 5 but have too many | |
| // separators the vector length would become too long | |
| gologger.Verbose().Msgf("Rejecting input: %s since it has more than 10 tokens", host) | |
| continue | |
| } |
🤖 Prompt for AI Agents
In internal/patternmining/patternmining.go around lines 289 to 299, the code
incorrectly sums character lengths of tokens and rejects hosts when that
character sum is >10; change this to check the actual token count instead.
Replace the sum calculation and if-check with a simple count check: if
len(tokens[0]) > 10 { ... } (i.e., compare number of tokens, not sum of token
lengths), and update the log message to indicate rejection due to too many
tokens rather than too many characters. Ensure the continue behavior and verbose
log remain the same.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
mutator.go (1)
63-91: Mode validation and defaults look good, but remove the no-op assignment.The validation logic correctly defaults to "default" mode for backward compatibility and validates the mode string. Default values for mining parameters are reasonable.
However, lines 85-86 are a no-op since
NgramsLimitis already 0 by default:- if v.NgramsLimit == 0 { - v.NgramsLimit = 0 - }Consider either removing this entirely or adding a non-zero default if one is intended. Based on the comment "0 = no limit" in the patternmining.Options struct, the current zero value is intentional, making this assignment unnecessary.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
mutator.go(6 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
mutator.go (4)
internal/patternmining/patternmining.go (4)
Options(27-44)Miner(82-85)Result(47-50)NewMiner(88-93)internal/runner/runner.go (1)
Options(17-40)inputs.go (1)
Input(14-22)replacer.go (1)
Replace(19-27)
🔇 Additional comments (12)
mutator.go (12)
10-11: LGTM: Necessary imports for concurrent execution and pattern mining.The addition of
sync,sync/atomic, and the internalpatternminingpackage appropriately supports the new concurrent mode execution and atomic time tracking.Also applies to: 14-14
46-61: LGTM: Well-structured Options additions.The new fields support the three operational modes (default, discover, both) with appropriate mining parameters. The fields are well-documented and maintain backward compatibility.
98-98: Excellent fix for the race condition!Changing
timeTakenfromtime.Durationtoint64with atomic operations (used at lines 264 and 468) properly resolves the data race flagged in the previous review. The nanoseconds are stored atomically and converted back to duration when read.
101-103: LGTM: New fields appropriately support pattern mining.The addition of
rootDomain,miner, andminingResultfields enables the library to maintain mining state and results for programmatic access.
107-175: Well-structured refactoring with clear mode-based initialization.The refactored
New()function cleanly separates concerns:
- Validates options upfront (lines 108-110)
- Initializes mining components for discover/both modes (lines 116-137)
- Initializes default generation components for default/both modes (lines 139-172)
The mode-specific initialization ensures that "both" mode properly sets up both pipelines while maintaining backward compatibility when mode is unspecified. Error handling is appropriate throughout.
177-190: LGTM: Clear error handling and appropriate guards.The
SaveRules()method properly validates that mining is enabled and results are available before attempting to save. The error messages are clear and the success logging is helpful.
194-204: LGTM: Concurrent execution properly synchronized with atomic time tracking.The refactored
Execute()uses a WaitGroup to synchronize parallel mining and default generation goroutines. The atomic storage oftimeTakenat line 264 correctly addresses the race condition from the previous review.Also applies to: 261-265
213-216: Verify the intended behavior when mining fails.When mining fails, the error is logged and the goroutine returns without sending any results. This means:
- In "discover" mode: users get zero results with only a log message indicating failure
- In "both" mode: users still get default generation results, potentially not realizing mining failed
Is this graceful degradation intentional, or should mining failures in "discover" mode cause the entire operation to fail more explicitly? Consider whether the current logging-only approach provides sufficient visibility to users who might not check logs.
220-234: LGTM: Proper filtering of input domains from mined results.The
seenmap correctly filters out input domains from the generated subdomains (lines 227-230), preventing the regulator algorithm from returning input subdomains as new discoveries. The map is local to the mining goroutine, avoiding any concurrency issues.
238-259: LGTM: Default generation properly checks context for cancellation.The default generation goroutine correctly checks
ctx.Done()at line 248 before processing each pattern, allowing timely cancellation of the operation.
349-351: LGTM: EstimateCount correctly includes mining results.The addition of mining result counts (when available) to the overall estimate is correct and maintains accuracy across all modes.
467-470: LGTM: Atomic time access completes the race condition fix.The
Time()method now usesatomic.LoadInt64to safely read the concurrently-writtentimeTakenfield, completing the fix for the data race flagged in the previous review.
| go func() { | ||
| defer wg.Done() | ||
|
|
||
| // Run pattern mining | ||
| gologger.Info().Msgf("Running pattern mining on %d domains...", len(m.Options.Domains)) | ||
| result, err := m.miner.Mine() | ||
| if err != nil { | ||
| gologger.Error().Msgf("pattern mining failed: %v", err) | ||
| return | ||
| } | ||
| m.miningResult = result | ||
| gologger.Info().Msgf("Discovered %d patterns from input domains", len(result.Patterns)) | ||
|
|
||
| var seen = make(map[string]bool) | ||
| for _, sub := range m.Options.Domains { | ||
| seen[sub] = true | ||
| } | ||
| // In discover mode, only use mined patterns | ||
| generated := m.miner.GenerateFromPatterns(m.miningResult.Patterns) | ||
| for _, subdomain := range generated { | ||
| if seen[subdomain] { | ||
| // skip the input subdomains | ||
| // regulator algo has tendency to generate input subdomains as patterns | ||
| continue | ||
| } | ||
| seen[subdomain] = true | ||
| results <- subdomain | ||
| } | ||
| }() |
There was a problem hiding this comment.
🧩 Analysis chain
Add context cancellation support to the mining goroutine.
The mining goroutine runs a potentially long operation (m.miner.Mine() at line 212) but never checks if the context has been cancelled. In contrast, the default generation goroutine properly checks ctx.Done() at line 248.
This inconsistency means that if a user cancels the operation (via context cancellation), the default generation will stop, but mining will continue to consume resources until completion.
If m.miner.Mine() accepts a context parameter, pass it through. Otherwise, consider periodically checking the context or document this limitation. For example:
go func() {
defer wg.Done()
select {
case <-ctx.Done():
return
default:
}
// Run pattern mining
gologger.Info().Msgf("Running pattern mining on %d domains...", len(m.Options.Domains))
result, err := m.miner.Mine() // or m.miner.MineWithContext(ctx) if available
// ... rest of the logic
}()Can you verify whether Mine() supports context cancellation, or add periodic context checks within the mining loop?
---
🏁 Script executed:
```shell
#!/bin/bash
# Find mutator.go and examine the context around lines 207-235 and 248
fd -t f 'mutator\.go' | head -5
Length of output: 10
🏁 Script executed:
#!/bin/bash
# Examine the mining goroutine and context handling
fd -t f 'mutator\.go' -x cat {} | head -300 | tail -100Length of output: 2624
🏁 Script executed:
#!/bin/bash
# Search for the miner interface definition and Mine() method
rg -n "func.*Mine\(" --type=go -B2 -A5Length of output: 617
🏁 Script executed:
#!/bin/bash
# Search for ctx.Done() usage in mutator.go to understand the pattern
rg -n "ctx\.Done\(\)" --type=go -B3 -A3Length of output: 321
🏁 Script executed:
#!/bin/bash
# Look for the miner interface to understand available methods
rg -n "type.*[Mm]iner.*interface" --type=go -A10Length of output: 0
---
🏁 Script executed:
```shell
#!/bin/bash
# Find Miner type definition and all its methods
rg -n "type Miner " --type=go -A2
Length of output: 218
🏁 Script executed:
#!/bin/bash
# List all methods on Miner type in patternmining.go
rg -n "func \(m \*Miner\)" internal/patternmining/patternmining.goLength of output: 926
🏁 Script executed:
#!/bin/bash
# Find the Execute function signature in mutator.go
rg -n "func.*Execute" mutator.go | head -5Length of output: 132
🏁 Script executed:
#!/bin/bash
# Check function signature and context parameter
rg -n "func \(m \*Mutator\) Execute" mutator.go -A5Length of output: 442
🏁 Script executed:
#!/bin/bash
# Search for MineWithContext or any context-aware mining methods
rg -n "MineWithContext|miner\.Mine" --type=goLength of output: 47
🏁 Script executed:
#!/bin/bash
# Examine the Mine() implementation to understand its operations
sed -n '96,260p' internal/patternmining/patternmining.goLength of output: 4797
Add context cancellation support to the mining goroutine.
The mining goroutine runs a potentially long-running operation (m.miner.Mine()) without checking if the context has been cancelled, while the default generation goroutine properly checks ctx.Done() at line 248. The Mine() method doesn't currently accept a context parameter.
At minimum, add an early context check before the mining operation:
go func() {
defer wg.Done()
select {
case <-ctx.Done():
return
default:
}
// Run pattern mining
gologger.Info().Msgf("Running pattern mining on %d domains...", len(m.Options.Domains))
result, err := m.miner.Mine()
// ... rest of the logic
}()For a more complete fix, modify Mine() to accept a context parameter and periodically check it during the clustering loops to allow early termination on cancellation.
🤖 Prompt for AI Agents
In mutator.go around lines 207 to 235, the mining goroutine calls m.miner.Mine()
without honoring context cancellation; add an early ctx cancellation check
before starting mining (return immediately if <-ctx.Done()) and propagate a
context through the miner by changing Mine() to Mine(ctx context.Context) and
update all callsites so Mine periodically checks ctx.Done() inside its
long-running loops and returns promptly on cancellation; ensure the goroutine
uses the new signature and handles returned errors the same way.
Fixes incorrect validation logic that was summing character lengths instead of counting tokens. The previous code rejected valid subdomains like `payments-api.internal.example.com` because it summed character lengths (8+3+8=19) rather than counting actual tokens (3). Now properly counts the number of tokens across all subdomain levels by summing `len(tokenGroup)` for each group, preventing legitimate subdomains from being incorrectly rejected. Addresses CodeRabbit review feedback. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Summary
This PR refactors the pattern mining implementation to move it from the CLI layer into the core library layer, enabling programmatic usage and better separation of concerns.
Key Changes
default,discover, andbothmodesgetNValidateRootDomainfunction inutil.gowith proper error handling for edge casesOptions.Validate()method consolidates all validation logicModefield defaults to "default" mode, preserving existing behaviorBenefits
Testing
Breaking Changes
None - this is a pure refactoring that maintains full backward compatibility.
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Improvements
Validation
Documentation