Status: Shipped. The bash parser is implemented (v0.1.x); v0.2.0 adds the
PowerShell parser and the shared multi-shell surface.
Audience: Whoever (human or agent) works on ShellSyntaxTree.
Read this end-to-end before writing any code.
PowerShell support is specified separately in SPEC.POWERSHELL.md (v0.2.0);
this document is the canonical home of the public API, AST, sanitization
workflow, and consumer contract that PowerShell reuses.
This document specifies the public API, AST, grammar, verb tables, resolver semantics, and corpus contract for ShellSyntaxTree v0.1. The library is a focused bash command parser designed for security gate evaluators — tools that inspect agent-emitted shell commands to decide whether to allow, prompt for, or deny execution.
It is not a general-purpose shell interpreter. It does not execute, expand, or evaluate commands. It returns a structured AST that consumers walk to make decisions.
The original consumer is Netclaw's approval policy. The library is designed to be reusable beyond Netclaw — any tool that needs to reason about the shape of an agent-emitted bash command can consume it.
- Parse bash commands into a structured AST with per-clause verbs, args, redirects, and compound operators.
- Extract paths a command operates on with per-verb knowledge of which
positional args are paths vs flags vs literal values (
chmod 755 fileknows755is a mode). - Honor
cd <dir> && cmdpropagation within a compound —<dir>counts as a path each subsequent command operates on. - Recurse into
bash -c "<inner>"so the inner command is parsed and its clauses surface to the consumer. - Mark dynamic-content tokens (unresolved
$VAR, unexpanded globs) so consumers don't misextract literal$VAR/fooas a path. - Multi-shell-ready via
IShellParserinterface — bash is the only v0.1 implementation; PowerShell and cmd are deferred to later versions without breaking the seam.
- PowerShell parsing (deferred; interface seam is present).
- Windows cmd parsing (deferred).
- Command execution. The library never runs anything.
- Variable expansion. We mark dynamic tokens, never resolve them.
- Function definitions, here-docs body extraction, complex parameter
expansion (
${var//pattern/replacement}), arithmetic expansion. - Command-substitution evaluation.
$(cmd)and backtick`cmd`are recognized at the lex level and collapsed into a singleKind=DynamicSkip, IsPath=falsearg per locked interpretation #2 (seeopenspec/changes/archive/.../v0.1-locked-interpretations). The surrounding clause stays parseable so hard-deny rules still fire on visible parts. - Performance tuning beyond "fast enough to invoke per shell call without noticeable latency" (~1ms per typical input).
The package exposes a small surface from a single namespace
ShellSyntaxTree. Public types only:
namespace ShellSyntaxTree;
/// <summary>
/// Parses shell command strings into structured ASTs.
/// </summary>
public interface IShellParser
{
/// <summary>
/// Parse the command. Always returns a ParsedCommand; sets
/// <see cref="ParsedCommand.IsUnparseable"/> when the input cannot
/// be tokenized (unbalanced quotes, etc.). Never throws on
/// well-formed strings; throws ArgumentNullException on null input.
/// </summary>
ParsedCommand Parse(string command);
}
/// <summary>Bash implementation of IShellParser.</summary>
public sealed class BashParser : IShellParser
{
public BashParser();
public BashParser(BashParserOptions options);
public ParsedCommand Parse(string command);
}
/// <summary>PowerShell implementation of IShellParser (v0.2.0). The
/// PowerShell grammar, tables, and resolver are specified in
/// SPEC.POWERSHELL.md.</summary>
public sealed class PwshParser : IShellParser
{
public PwshParser();
public PwshParser(PwshParserOptions options);
public ParsedCommand Parse(string command);
}
/// <summary>Shell-neutral resolver configuration shared by every parser
/// (added v0.2.0). HomeDirectory / WorkingDirectory live here.</summary>
public abstract record ShellParserOptions { ... }
/// <summary>Configuration knobs for BashParser. As of v0.2.0 a sealed
/// record deriving from ShellParserOptions; the v0.1 object-initializer
/// shape is unchanged.</summary>
public sealed record BashParserOptions : ShellParserOptions;
/// <summary>Configuration knobs for PwshParser (v0.2.0). Empty — the
/// resolver knobs live on ShellParserOptions.</summary>
public sealed record PwshParserOptions : ShellParserOptions;
// The pre-v0.2.0 BashParserOptions body, now hoisted onto ShellParserOptions:
public abstract record ShellParserOptions
{
/// <summary>
/// User home directory used to expand `~` and `$HOME` tokens during
/// resolution. Defaults to <see cref="Environment.SpecialFolder.UserProfile"/>.
/// </summary>
public string? HomeDirectory { get; init; }
/// <summary>
/// Working directory used to resolve relative path tokens during
/// resolution. Defaults to the daemon-process cwd.
/// </summary>
public string? WorkingDirectory { get; init; }
}
// AST records — see §3.
public sealed record ParsedCommand { ... }
public sealed record Clause { ... }
public sealed record VerbChain { ... }
public sealed record Arg { ... }
public sealed record Redirect { ... }
public enum ArgKind { Literal, EnvVar, Glob, Tilde, DynamicSkip }
public enum RedirectDirection { In, Out, Append, ErrOut, ErrAppend }
public enum CompoundOperator { None, AndIf, OrIf, Sequence, Pipe }That's the entire public API. Everything else is internal. The lexer, parser internals, verb tables, resolver — all implementation detail.
The top-level result of parsing. Always returned (never null).
public sealed record ParsedCommand
{
/// <summary>The original input string, verbatim.</summary>
public string Source { get; init; } = "";
/// <summary>
/// Top-level clauses, split on compound operators (&&, ||, ;, |).
/// For a simple command, exactly one clause with Operator=None.
/// </summary>
public IReadOnlyList<Clause> Clauses { get; init; } = [];
/// <summary>
/// True when the parser could not produce a clean AST (unbalanced
/// quotes, unparseable construct). When true, Clauses MAY be partial
/// or empty. Consumers should route to safe-fail.
/// </summary>
public bool IsUnparseable { get; init; }
/// <summary>
/// Human-readable diagnostic when IsUnparseable=true; null otherwise.
/// </summary>
public string? UnparseableReason { get; init; }
}One logical command within a compound. Each clause has its own verb chain, args, redirects, and the operator that joined it to the previous clause.
public sealed record Clause
{
/// <summary>
/// The operator joining this clause to the previous one. The first
/// clause in a ParsedCommand has Operator=None. Subsequent clauses
/// carry the operator that preceded them in the source
/// (e.g. `a && b` produces clauses [{None,a}, {AndIf,b}]).
/// </summary>
public CompoundOperator Operator { get; init; }
/// <summary>The verb chain (see §3.3 and §6).</summary>
public VerbChain Verb { get; init; } = new();
/// <summary>
/// All argument tokens after the verb chain, in source order. Includes
/// flags and positional args. See <see cref="Arg.Kind"/> for token kind.
/// </summary>
public IReadOnlyList<Arg> Args { get; init; } = [];
/// <summary>
/// Redirect operators on this clause (>, >>, <, 2>, 2>>). Each entry
/// includes direction and target path.
/// </summary>
public IReadOnlyList<Redirect> Redirects { get; init; } = [];
/// <summary>
/// True when this clause is wrapped in a subshell (parens). Subshells
/// isolate cd state — see §9.
/// </summary>
public bool IsSubshell { get; init; }
/// <summary>
/// True when this clause is the result of recursing into a
/// command-string wrapper — `bash -c "..."` / `sh -c "..."`, or (v0.2.0)
/// PowerShell `pwsh -Command "..."` / `pwsh -EncodedCommand ...`. Useful
/// for consumers that want to surface "this came from a wrapped
/// invocation" in UI.
/// </summary>
/// <remarks>Renamed from `IsCommandStringWrapped` in v0.2.0 — see RELEASE_NOTES.md
/// and SPEC.POWERSHELL.md §3 for the old→new mapping.</remarks>
public bool IsCommandStringWrapped { get; init; }
}The verb of a clause. Multi-token to handle commands like git push,
docker compose up, dotnet ef migrations add. Length determined by the
greedy verb-chain heuristic in §6.1 — consecutive verb-like Word tokens
from the start of the clause, transparently consuming flag-with-value
pairs, with a 1-token carveout for FILE verbs.
public sealed record VerbChain
{
/// <summary>
/// Verb tokens in source order. Empty when the clause has no verb
/// (e.g. clause is just a redirect or an empty fragment).
/// </summary>
public IReadOnlyList<string> Tokens { get; init; } = [];
/// <summary>
/// The canonical, alias-resolved verb identity (added v0.2.0). Non-null
/// only when the parser rewrote a built-in alias — `ls` → `Get-ChildItem`.
/// Null for every bash clause. See SPEC.POWERSHELL.md §3.
/// </summary>
public string? CanonicalVerb { get; init; }
/// <summary>
/// True when the clause's command name is a dynamic token the parser
/// cannot statically identify — `& $exe`, `& { ... }` (added v0.2.0).
/// Always false for bash clauses. See SPEC.POWERSHELL.md §3.
/// </summary>
public bool IsDynamic { get; init; }
/// <summary>Convenience: tokens joined with spaces.</summary>
public string Joined => string.Join(" ", Tokens);
}Note: The single-space form
string.Join(" ", …)is used (not thecharoverloadstring.Join(' ', …)) so the implementation compiles on bothnetstandard2.0andnet8.0. Thecharoverload is net5+ only.
One argument token after the verb chain. Includes resolution state.
public sealed record Arg
{
/// <summary>Verbatim token from the source.</summary>
public string Raw { get; init; } = "";
/// <summary>
/// Resolved value for path tokens — tilde expanded, env vars
/// substituted, normalized to absolute path against
/// BashParserOptions.WorkingDirectory. Null when Kind is not a path
/// (Literal non-path / Glob / DynamicSkip).
/// </summary>
public string? Resolved { get; init; }
/// <summary>Token kind. See <see cref="ArgKind"/>.</summary>
public ArgKind Kind { get; init; }
/// <summary>
/// True when this token starts with '-' or '--' (a flag, not a
/// positional arg).
/// </summary>
public bool IsFlag => Raw.StartsWith('-');
/// <summary>
/// True when this token is a path the clause operates on (per the
/// per-verb pathArgs table; see §7). Set during parsing so consumers
/// don't reapply per-verb rules.
/// </summary>
public bool IsPath { get; init; }
/// <summary>
/// True when this Arg is a synthetic attribution arg representing
/// the working directory inherited from a preceding `cd`/`chdir`
/// clause in the same compound. Default false. See §9 for
/// propagation semantics.
/// </summary>
public bool IsCwdAttribution { get; init; }
}
public enum ArgKind
{
/// <summary>Literal value (string, number, flag).</summary>
Literal,
/// <summary>Token containing an unresolved env var reference.</summary>
EnvVar,
/// <summary>Token containing glob metachars (* ? [).</summary>
Glob,
/// <summary>Token starting with ~ (tilde).</summary>
Tilde,
/// <summary>
/// Token whose value cannot be safely resolved (unresolved env var,
/// unexpandable glob). Consumers SHALL treat as "no value extracted"
/// rather than using Raw as a literal path.
/// </summary>
DynamicSkip
}public sealed record Redirect
{
public RedirectDirection Direction { get; init; }
/// <summary>
/// Redirect target. Normally a path resolved per Arg conventions
/// (§8); for fd-dup / fd-close shorthand (`&N`, `&N-`, `&-`)
/// the raw token is carried verbatim and IsDynamicSkip is true.
/// </summary>
public string Target { get; init; } = "";
/// <summary>
/// True when the target is opaque to path resolution — a dynamic
/// token (env var, command substitution) or an fd-dup / fd-close
/// form. Consumers MUST NOT treat Target as a path when this is true.
/// </summary>
public bool IsDynamicSkip { get; init; }
}
public enum RedirectDirection
{
In, // <
Out, // >
Append, // >>
ErrOut, // 2>
ErrAppend // 2>>
}public enum CompoundOperator
{
None, // first clause; no prior operator
AndIf, // &&
OrIf, // ||
Sequence, // ;
Pipe // |
}Approximate BNF for what the parser accepts. Anything outside this grammar
is unparseable (ParsedCommand.IsUnparseable = true).
command := clause (compound_op clause)*
compound_op := "&&" | "||" | ";" | "|" | NEWLINE
clause := subshell | bash_c_wrapper | simple_clause
subshell := "(" command ")"
bash_c_wrapper := ("bash" | "sh") "-c" QUOTED_STRING
simple_clause := verb_chain arg* redirect*
verb_chain := verb_like_word (FW_pair? verb_like_word)*
// greedy walk per §6.1; FW_pair is a
// flag-with-value pair owned by word_0
// (transparent to the walk); stops at
// the first non-verb-like token. For
// word_0 ∈ FileVerbs, exactly 1 token.
arg := word | flag | quoted_string
flag := "-" letter+ | "--" word
redirect := redirect_op target
redirect_op := ">" | ">>" | "<" | "2>" | "2>>"
target := word | quoted_string
word := non-whitespace, non-operator characters
quoted_string := single-quoted | double-quoted
Notes:
- Whitespace between tokens is one or more spaces or tabs.
- A bare newline outside quotes, heredoc bodies, line continuations, and
$(...)/ backtick substitutions is a statement separator — semantically equivalent to;, producingCompoundOperator.Sequence. Consecutive newlines, leading and trailing newlines, and a newline immediately following a compound operator all collapse: they never yield an empty clause. The newline after a heredoc terminator likewise separates the heredoc's clause from what follows. \followed by a newline is a line continuation (treat as whitespace).- Bash line comments (
#at a word boundary through end-of-line) are whitespace-equivalent at the lexer level — they emit a Comment token for source fidelity but are filtered alongside Whitespace by the parser, so they do not appear in the grammar. See §5 "Comment handling" for boundary rules. \before a metachar inside a double-quoted string escapes the metachar.- Single-quoted strings preserve all bytes literally — no escape processing.
- Heredocs (
<<EOF ... EOF) are recognized as a redirect operator but the body is skipped (not extracted). The clause containing the heredoc parses normally with the heredoc body removed. - Redirect targets matching the POSIX fd-dup / fd-close shorthand —
&N,&N-, or&-(whereNis one or more decimal digits) — are NOT path-resolved. The parser carries the raw token (e.g.&1) onRedirect.Targetand setsRedirect.IsDynamicSkip = true. This prevents2>&1from being incorrectly resolved to<cwd>/&1. - Function definitions,
for/while/do/done/then/fi/case/esaccontrol-flow keywords, and arithmetic expansion$(( ... ))causeIsUnparseable = true. We don't support these in v0.1.
The lexer produces tokens consumed by the parser. Token kinds:
- WORD — sequence of non-whitespace, non-operator, non-quote chars.
Example:
git,/etc/foo,--force,~/path,$VAR. Simple parameter expansion${VAR}(no//slash) is absorbed into a Word token; the resolver in §8 decidesKind. - QUOTED_STRING — single- or double-quoted string. The lexer strips
the quote delimiters from the token value. Example:
"hello world"becomes the token valuehello world. - OPERATOR —
&&,||,;,|,>,>>,<,2>,2>>,(,),<<,<<-. - WHITESPACE — one or more spaces, tabs, or newlines (newlines inside
a skipped heredoc body are not tokenized). A whitespace run that
contains a newline — including the newline after a heredoc terminator —
is flagged as a statement separator; the parser retains those
tokens past
FilterSignificantand splits clauses on them per §4. A pure space/tab run carries no flag and is discarded after splitting. - CONTINUATION —
\+\n. Treated as whitespace. - OPAQUE_SUBSTITUTION —
$(cmd)or backtick`cmd`. The full substitution slice (including delimiters) becomes a single token. Boundary tracking handles nested same-kind regions, nested quotes, and\Xescapes via a shared opaque-region scanner. The parser consumes this token asArg{ Kind=DynamicSkip, IsPath=false, Resolved=null }per locked interpretation #2. - UNPARSEABLE_SENTINEL —
$((expr))arithmetic expansion or${var//pat/repl}complex parameter expansion. The lexer skips past the matching close ())or}respectively) and emits a sentinel whose reason names the rejected construct. The parser consumes this token by setting outerParsedCommand.IsUnparseable = true(see §11). - COMMENT —
#at a word boundary (start of input, or preceded by whitespace, a newline, an operator, or any other lexer-recognized boundary) starts a line comment running to (but not including) the next newline. The lexer emits a single Comment token covering the#and the comment text, for source fidelity. The parser drops Comment tokens inFilterSignificantalongside Whitespace and Continuation — comments produce no clauses, args, redirects, or flags. See "Comment handling" below for boundary rules.
- Single quotes
'...'preserve bytes literally. No escape processing, no variable expansion. Anything inside is one token. - Double quotes
"..."preserve whitespace but allow:\"escapes the closing quote.\\escapes a backslash.\$escapes a dollar sign.$VARand${VAR}are recognized as env var references but not expanded — the token is markedArgKind.EnvVar(orDynamicSkipif resolution would be required for path classification).
- Unbalanced quotes →
IsUnparseable = truewith reason"unbalanced quote at position N".
\Xoutside quotes: removes the backslash, takes X literally. Example:echo \$HOMEproduces token$HOMEwithArgKind.Literal.\Xinside double quotes: only\",\\,\$,\\, and\\+newline are recognized escape sequences. Other backslashes preserved literally.
Operators terminate the current token. cd /tmp&&ls lexes as
[cd, /tmp, &&, ls] — no whitespace required around operators. The lexer
must handle this.
- An unquoted
#that appears at a word boundary starts a comment that runs to (but does not include) the next newline. A word boundary is: start of input, or the position immediately after a whitespace run, a newline, an operator (&&,||,;,|,>,>>,<,2>,2>>,(,),<<,<<-), a quoted string, or an opaque substitution. Equivalently:#is comment-start everywhere the outer lexer dispatch loop sits, because every other lexer rule has already consumed its territory before#is considered. #inside single or double quotes is a literal character (no comment).#in the interior of an unquoted word (e.g.abc#def) is a literal character.ReadWordconsumes the whole word before the outer loop can see the embedded#; there is no re-scanning.\#(backslash-escaped#outside quotes) is consumed by the normal escape rule — the backslash is dropped and#becomes a regular word character. Equivalent example:cmd \#abcproduces one Word token#abc.- The terminating newline is not consumed by the Comment token. It survives as a Whitespace token, preserving statement-boundary semantics for the parser (see §4).
- A Comment token's
Valueis empty (matchingWhitespace/Continuation);SourceStart/SourceLengthidentify the slice including the leading#so callers that need the literal text can recover it from the original input span. - Effect on parsing: comment-only input parses to
Clauses = [],IsUnparseable = false— mirroring empty-input behavior. A comment leading, trailing, or interleaved with a clause contributes no tokens to the verb chain, args, or redirects of any clause.
These are data, not logic. Implement as static readonly collections.
Per issue #27 (locked in v0.1.4-alpha), the parser does not consult a
static arity table. Instead, it walks consecutive verb-like Word tokens
from the start of the clause and stops at the first token that doesn't
look like a subcommand. This naturally scales to unknown CLIs
(freshdesk ticket list, kubectl get pods, dotnet ef migrations add)
without curated table entries.
A token is "verb-like" when all of these hold:
Kind == BashTokenKind.Word(quoted strings are values, never verbs at index ≥ 1).- Length is in
[1, 64]characters. - First character is an ASCII lowercase letter
[a-z]. - Remaining characters are drawn from
[a-z0-9._-]only.
The predicate is implemented in BashVerbs.IsVerbLikeToken. The leading
lowercase requirement mirrors real CLI subcommand convention; the
character allow-list naturally excludes flags (-x starts with -),
paths (/, \, ~), env-var refs ($VAR), URLs (://), globs
(* ? [), and user-named identifiers (uppercase first char like
InitialCreate).
For a clause whose first token is a Word firstVerb:
- Append
firstVerbto the verb chain (it does not need to satisfyIsVerbLikeToken— bare commands likeCurlor_initare still commands). - Iterate the remaining tokens in order. For each token
t:- If
t.Kind != Word: stop. - If
tis a flag (IsFlagWord):- If
firstVerbhas aFlagsWithValueentry containingStripEqualsValue(t.Value)AND the next token isWordorQuotedStringANDt.Valuehas no inline=: consume both as a flag-value pair, mark their indices forconsumedFlagValueIndices, and continue walking. - Otherwise: stop.
- If
- If
firstVerb ∈ FileVerbs: stop (1-token carveout — see below). - If
!IsVerbLikeToken(t): stop. - Otherwise: append
t.Valueto the verb chain and continue.
- If
If the first token is a QuotedString (e.g. "git" push origin main),
emit a 1-token verb chain [firstVerb] and skip the walk entirely. Bash
treats the quoted form as a verb-identity carrier; remaining tokens are
arg-list material.
For verbs in §6.3 FileVerbs (file-mutation, file-read, editors,
compression, shell loaders, etc.), the verb chain stops at exactly one
token. The flag-with-value consumption still runs so the value of
curl -o file, tar -C /path, git -C /repo style flags picks up
IsPath=true via the FlagValueIsPath mechanism.
The carveout exists because FileVerbs use SPEC §7 per-verb positional
rules to classify args as paths. Without it, a bare-name target like
cat README would over-extract — README is shape-wise verb-like —
and lose the IsPath=true classification downstream consumers depend
on for zone-gate evaluation.
| Input | Verb chain | Args |
|---|---|---|
git push origin main |
[git, push, origin, main] |
[] (over-extracts; see §6.1.1) |
git -C /repo worktree list --porcelain |
[git, worktree, list] |
[-C, /repo, --porcelain] |
freshdesk ticket list --status open |
[freshdesk, ticket, list] |
[--status, open] |
kubectl get pods my-pod |
[kubectl, get, pods, my-pod] |
[] |
aws s3 cp src dst |
[aws, s3, cp, src, dst] |
[] (bare-word path args over-extract) |
dotnet ef migrations add InitialCreate |
[dotnet, ef, migrations, add] |
[InitialCreate] (stops at uppercase) |
cat /etc/passwd |
[cat] |
[/etc/passwd] (FileVerb carveout) |
cat README |
[cat] |
[README] (FileVerb carveout preserves IsPath) |
ls -la /tmp |
[ls] |
[-la, /tmp] (FileVerb carveout) |
chmod 755 file |
[chmod] |
[755, file] (digit-start kills walk; FileVerb anyway) |
echo hello |
[echo, hello] |
[] (echo is not a FileVerb; over-extracts) |
Clause.Verb is a convenience hint, not a security contract.
The parser deliberately over-extracts on bare-word args because no
syntactic rule disambiguates origin (a branch name) from worktree
(a subcommand verb) without per-CLI semantic knowledge — and we will
not bake per-CLI knowledge into the parser.
Consumers needing security-grade verb identification should pattern-prefix match against the raw token stream:
A command matches an approval pattern
Pif and only if the firstlen(P.verb_prefix)tokens of the command equalP.verb_prefix.
This punts depth choice to the consumer (via the pattern they author) and accommodates the parser's over-extraction transparently:
- Pattern
git push *(verb-prefix length 2) matchesgit push origin mainbecause the first two command tokens are[git, push]. - Pattern
kubectl get pods *(verb-prefix length 3) matcheskubectl get pods my-podbecause the first three tokens are[kubectl, get, pods]. - Auto-proposed patterns for unknown commands should default to the full extracted verb chain (greedy match), which is the security-correct default: a subsequent variation re-prompts rather than silently auto-grants. Operators wanting broader grants opt in explicitly.
False-negative (re-prompt) is recoverable. False-positive (silent destructive grant) is not. Narrow-by-default favors the recoverable failure mode.
Verbs whose first non-flag positional arg becomes the cwd for subsequent clauses in the same compound (see §9).
internal static readonly HashSet<string> CwdVerbs =
new(StringComparer.OrdinalIgnoreCase)
{
"cd", "chdir", "popd", "pushd",
"push-location", "set-location" // PowerShell idioms (forward-compat)
};Verbs whose positional args are paths. The default extraction rule is "all non-flag positional args after the verb chain are paths." Per-verb overrides in §7.
internal static readonly HashSet<string> FileVerbs =
new(StringComparer.OrdinalIgnoreCase)
{
// CWD verbs are also FILE verbs (their target is a path)
"cd", "chdir", "popd", "pushd", "push-location", "set-location",
// File mutation
"rm", "cp", "mv", "mkdir", "rmdir", "touch", "ln",
"chmod", "chown", "chgrp", "stat", "test",
// Read
"cat", "less", "more", "head", "tail", "grep", "rg",
"find", "fd", "locate", "wc", "file",
// Editors / text tools
"sed", "awk", "vi", "vim", "nano", "emacs", "ed",
// Compression
"tar", "zip", "unzip", "gzip", "gunzip", "bzip2", "xz",
// Network with file targets
"curl", "wget", "scp", "rsync", "sftp",
// Shell / interpreter loaders
"bash", "sh", "zsh", "fish",
"python", "python3", "node", "ruby", "perl", "php",
// Diff / patch
"diff", "patch", "cmp",
// Listing
"ls", "dir", "tree",
};The Windows native file utilities. As of v0.2.0 the PowerShell parser's
PwshVerbs.FileVerbs table consumes this reserved set
(type, copy, move, del, xcopy, robocopy, findstr) so a
native Windows file tool in a PowerShell command still gets path
classification. PowerShell cmdlet file verbs (Get-Content,
Remove-Item, Copy-Item, ...) are owned by SPEC.POWERSHELL.md §6.4 —
they are recognized by cmdlet shape and alias resolution, not by this
table. A Windows cmd parser remains deferred (§18).
internal static readonly HashSet<string> CmdFileVerbs =
new(StringComparer.OrdinalIgnoreCase)
{
"type", "copy", "move", "del", "erase", "ren",
"xcopy", "robocopy", "findstr",
};The default rule for FILE verbs: every non-flag positional arg after the verb chain is a path. Per-verb overrides:
| Verb | Rule |
|---|---|
chmod |
First non-flag positional is mode (e.g. 755, +x); rest are paths. |
chown |
First non-flag positional is user[:group]; rest are paths. |
chgrp |
First non-flag positional is group; rest are paths. |
ln |
All positionals are paths (source then target). |
find |
First positional is a path; rest are predicate args (skip). |
grep |
First positional is pattern; rest are paths. |
rg |
First positional is pattern; rest are paths. |
sed |
First positional is script; rest are paths. |
awk |
First positional is program; rest are paths. |
tar |
Action flag determines path roles; default to extracting all non-flag positionals as paths. |
curl, wget |
First positional is URL, not a path. -o file flag arg is a path. |
scp, rsync, sftp |
All positionals are paths (some remote). |
cd, chdir, pushd, popd |
First non-flag positional is the cwd target (a path). |
| Others (in FileVerbs, no override) | All non-flag positionals are paths. |
Some flags take values (-o file, -C /repo, --output=file). The parser
must know which flags consume the next token as a value. Curated table:
internal static readonly IReadOnlyDictionary<string, HashSet<string>>
FlagsWithValue = new Dictionary<string, HashSet<string>>(
StringComparer.OrdinalIgnoreCase)
{
["git"] = new HashSet<string>(StringComparer.OrdinalIgnoreCase) { "-C", "--git-dir", "--work-tree" },
["curl"] = new HashSet<string>(StringComparer.OrdinalIgnoreCase) { "-o", "--output", "-d", "--data" },
["wget"] = new HashSet<string>(StringComparer.OrdinalIgnoreCase) { "-O", "--output-document" },
["docker"]= new HashSet<string>(StringComparer.OrdinalIgnoreCase) { "-v", "--volume", "-f", "--file" },
["tar"] = new HashSet<string>(StringComparer.OrdinalIgnoreCase) { "-f", "--file", "-C", "--directory" },
// Add as corpus surfaces real cases.
};Note: the value type is
HashSet<string>(notIReadOnlySet<string>) becauseIReadOnlySet<string>is .NET 5+ only and the library multi-targetsnetstandard2.0. Internal-only — no public-API impact.
Note: the verb-chain walk consumes flag-with-value pairs transparently. For
git -C /repo log, the walk consumes-C /repobefore evaluating the next token;logis then verb-like and extends the chain, producingVerb.Tokens = ["git", "log"]per §12's example. The same mechanic letsgit -C /repo worktree listextract the full 3-token chain per §6.1.
When a flag-with-value consumes the next token, the consumed token's
IsPath flag is set if the value is path-shaped (per the resolver in §8).
For git -C /repo log: the -C flag consumes /repo, marks it as a
path, then the verb chain continues with log.
--output=file (equals form) is parsed as one token; the path value after
= is extracted into a synthetic Arg with IsPath=true.
For each Arg with potential path content, the resolver attempts to produce a normalized absolute path. Resolution order:
-
Single-quoted bypass. If the source token came from a single-quoted string (per §5: bytes are preserved literally — no escape processing, no variable expansion), the resolver skips steps 1–5 entirely. Kind is
Literal;IsPathistrueandResolvedis set only when the slot is a path ANDTryResolveAbsolutePathon the raw bytes succeeds. Socat '/etc/passwd'still produces a resolved path, butecho '$HOME'stays literal —$HOMEis not expanded inside single quotes. -
Tilde expansion.
~→BashParserOptions.HomeDirectory.~/foo→<home>/foo.~usernot supported →DynamicSkip. -
Env-var substitution.
$VARand${VAR}are not expanded even if the value is inEnvironment. We treat any env var reference asDynamicSkipbecause the env var available at parse time may differ from what's available when the agent's command actually runs.$HOMEis the only exception — we treat it as equivalent to~and expand it fromBashParserOptions.HomeDirectory. -
filesystem::/pathprefix stripping. Some tools emitfilesystem::/path/to/file; strip the prefix. Become/path/to/file. -
Glob detection. Tokens containing
*,?, or[are markedArgKind.Glob. The resolver does not expand globs. The token stays as-is inRaw;Resolvedis null.In a path-arg slot:
IsPath = true. Consumers can apply the "covering directory" heuristic (Path.GetDirectoryName(Raw)) to reason about the directory the glob resolves under (e.g./tmp/*.bak→/tmp).In a non-path slot:
IsPath = false.Per locked interpretation #3, glob and DynamicSkip carry distinct signals — globs preserve a useful covering-dir hint that DynamicSkip tokens lack.
-
Relative path resolution. Tokens not starting with
/(or\\on Windows, or a Windows drive letterX:) are joined toBashParserOptions.WorkingDirectory(lazy fallback toEnvironment.CurrentDirectorywhen null). OnIOException/ path-format exceptions during resolution, fall through toKind = DynamicSkip, IsPath = false, Resolved = null. -
DynamicSkip predicates. A token is
Kind = DynamicSkip, IsPath = false, Resolved = nullwhen:- It contains an unresolved env-var reference (other than
$HOME) in a slot the verb's rule classifies as a path. - Resolution throws an
IOExceptionor path-format exception.
Globs do NOT downgrade to DynamicSkip — they carry their own Kind so consumers can still apply the covering-dir heuristic. Consumers must not use
Rawas a literal path forDynamicSkiptokens. - It contains an unresolved env-var reference (other than
When deciding whether a token "looks like a path" (used to decide whether to apply the resolver):
LooksLikePath(token) =
token starts with '/' (Unix absolute)
|| token starts with '\\' or '<letter>:' (Windows absolute)
|| token starts with './' or '../' (Unix relative)
|| token starts with '~' (Tilde)
|| token contains '/' anywhere
|| token contains '\\' at a NON-TRAILING position
|| token ends with a known file extension (.json, .md, .txt, .conf, ...)
|| token is in the args of a FileVerb at a position the per-verb rule
marks as a path
A lone trailing \\ is excluded because it commonly appears as a
double-quote escape-collapse artifact ("foo\\" lexes to Value foo\\)
and is not a meaningful path signal on its own.
The per-verb rule wins when present; the heuristic is the fallback.
The agent's natural idiom is cd /target && cmd1 && cmd2. Bash semantics:
cmd1 and cmd2 execute with cwd /target. The parser honors this for
path attribution within the same compound.
-
First clause is a
cdorchdirverb: the cd target becomes the attributed cwd for subsequent clauses in the same compound. Onlycdandchdirpropagate attribution per locked interpretation #5.pushd,popd,push-location, andset-locationare still listed inCwdVerbsso their first non-flag positional is path-classified (the target shows up asIsPath=true), but they do not add a synthetic attribution arg to subsequent clauses. A future v0.1.x or v0.2 with PowerShell support may modelpushd/popdas a proper directory stack. -
Subsequent clauses inherit the attributed cwd as if it were prepended with
-Csemantics. Specifically: a syntheticArgwithIsPath=true,Resolved=<cd target>, andKind=Literalis added to each subsequent clause'sArgslist at the end, marked with a flagIsCwdAttribution=trueso consumers can distinguish it from user-emitted args.(Add
IsCwdAttribution: boolto theArgrecord. Default false.) -
A subsequent
cdin the same compound replaces the attributed cwd for clauses after it. (cd /a && cmd1 && cd /b && cmd2→ cmd1 inherits/a, cmd2 inherits/b.) The replacingcd /bitself still receives/aas a synthetic attribution arg (rule 2) before becoming the new source — additive semantics per rule 5. -
Subshell boundaries reset attribution.
cd /a && (cd /b && cmd1) && cmd2: cmd1 (inside subshell) inherits/b; cmd2 (outside subshell) inherits/a(the subshell'scd /bdoes not leak out). A subshell inherits outer attribution on entry (socd /a && (cmd)still attributes cmd to /a) but its own cd changes stay isolated. -
Attribution does not change the clause's verb or original args. The attribution is purely additive — the
cdclause itself is still parsed normally, and subsequent clauses retain everything the user typed, plus the synthetic Arg.
When the cd target itself is Kind=DynamicSkip (e.g. cd $REPO), we
statically don't know the resolved cwd. To preserve the cwd-uncertainty
signal for subsequent clauses:
- A synthetic
Arg { Raw="<dynamic-cwd>", Resolved=null, Kind=DynamicSkip, IsPath=false, IsCwdAttribution=true }is appended to each subsequent clause (instead of the literal-cd flavor). - Relative path args in subsequent clauses are not re-resolved against a
fall-back cwd; they surface as
Kind=DynamicSkip, IsPath=false, Resolved=nullso consumers route to safe-fail rather than trust a guessed working directory.
Consumers that iterate IsPath=true args won't see the synthetic
attribution arg; consumers that specifically check IsCwdAttribution
can detect "this clause's cwd context is unknown" and elevate to
user-prompt instead of treating it like a default-cwd command.
Input: cd /target && git -C /other log && cat file.txt
Parsed clauses:
Clause 0: Operator=None, Verb=[cd], Args=[/target]
Clause 1: Operator=AndIf, Verb=[git, log],
Args=[
Arg{Raw="-C",IsFlag=true},
Arg{Raw="/other",IsPath=true,Resolved="/other"},
Arg{Raw="/target",IsPath=true,Resolved="/target",IsCwdAttribution=true}
]
Clause 2: Operator=AndIf, Verb=[cat],
Args=[
Arg{Raw="file.txt",IsPath=true,Resolved="/target/file.txt"},
Arg{Raw="/target",IsPath=true,Resolved="/target",IsCwdAttribution=true}
]
Note: file.txt in clause 2 resolves against the attributed cwd
/target to produce /target/file.txt. The attributed-cwd Arg is also
appended for completeness, even though the resolver already used it.
Consumers can choose to ignore IsCwdAttribution=true args if they
already see the resolved path in another arg.
Subshells are clauses wrapped in parens: (cd /a && cmd). The parser
recognizes the parens and flattens the subshell's inner clauses into
the parent's Clauses list, marking each with IsSubshell=true so
consumers can distinguish them from outer-compound clauses. A subshell
inherits the outer compound's cd attribution on entry but its own cd
changes stay isolated to the subshell (rule 4 above).
Specifically: (cd /b && cmd) && cmd2 produces three clauses:
Clause 0: Op=None, Verb=cd, Args=[/b], IsSubshell=true
Clause 1: Op=AndIf, Verb=cmd, Args=[/b attribution], IsSubshell=true
Clause 2: Op=AndIf, Verb=cmd2, Args=[] // no /b attribution — subshell isolated
bash -c "inner command" and sh -c "inner command" are common wrappers
the agent emits. The parser:
- Recognizes the
bash -corsh -cprefix. - Parses the quoted argument as a fresh
ParsedCommand. - Surfaces the inner command's clauses inline in the outer's
Clauseslist, each withIsCommandStringWrapped=true.
Example: bash -c "cd /a && cmd" produces:
Clause 0: Op=None, Verb=cd, Args=[/a], IsCommandStringWrapped=true
Clause 1: Op=AndIf, Verb=cmd, Args=[/a attribution], IsCommandStringWrapped=true
The outer bash -c itself does not appear as a clause — it's "consumed"
by the recursion. Consumers that care that this came from a wrapper can
inspect IsCommandStringWrapped on the surfaced clauses.
Recursion limit: parse bash -c "bash -c ..." chains up to depth 5.
Deeper nesting → set the outer ParsedCommand.IsUnparseable = true with
reason "bash -c recursion depth exceeded (>5)" per locked interpretation
#4. (Clause has no IsUnparseable field; we surface the overflow on the
top-level ParsedCommand so consumers safe-fail per §11.)
When the parser cannot produce a clean AST:
- Set
ParsedCommand.IsUnparseable = true. - Set
UnparseableReasonto a human-readable diagnostic. - Return whatever clauses were successfully parsed in
Clauses. May be empty. - Never throw on well-formed input strings (only throw on null).
Conditions that produce IsUnparseable = true:
- Unbalanced quotes (
"foowith no closing"). - Unbalanced parens (
(cmd && cmd2). - Unrecognized control-flow keywords (
for,while,do,done,then,fi,case,esac). - Function definitions (
name() { ... }). - Process substitution (
<(cmd),>(cmd)). - Arithmetic expansion
$((expr))(per §1 non-goal; lexer emits an UNPARSEABLE_SENTINEL token; parser sets the outer flag). - Complex parameter expansion
${var//pat/repl}(per §1 non-goal; same mechanism). - Recursion depth exceeded on
bash -cchains (>5 levels).
Diagnostic precedence. When multiple conditions could fire on a
single input (e.g. case x in a) ;; esac is both a control-flow
keyword AND has unbalanced parens), the parser checks them in this
order so the most informative reason wins:
- Lexer-emitted
UnparseableSentineltokens (unbalanced quote / unterminated heredoc / arithmetic / complex parameter expansion). - Control-flow keyword at verb position (start of input or
immediately after a clause separator
&&,||,;,|, or(). Catchescase x in a) ;; esacbefore the)triggers a paren-balance error. - Function definition pattern (
Wordimmediately followed by(,)). - Process substitution (
<(or>(adjacent). - Segment-split errors (unbalanced parens, unexpected operator).
bash -crecursion depth cap.
Consumers (e.g. Netclaw's gate evaluator) route unparseable commands to a safe-fail path (prompt the user; offer only Once and Deny — no persistent grants on shapes the parser can't model).
A handful of input/expected-AST pairs to anchor understanding. These belong in the corpus (§13) verbatim.
Input: ls -la /tmp
ParsedCommand {
Source = "ls -la /tmp",
IsUnparseable = false,
Clauses = [
Clause {
Operator = None,
Verb = VerbChain { Tokens = ["ls"] },
Args = [
Arg { Raw = "-la", IsFlag = true, Kind = Literal },
Arg { Raw = "/tmp", IsPath = true, Resolved = "/tmp", Kind = Literal }
],
Redirects = [],
IsSubshell = false,
IsCommandStringWrapped = false
}
]
}
Input: git push origin main
Clauses = [
Clause {
Verb = VerbChain { Tokens = ["git", "push", "origin", "main"] },
Args = []
}
]
The greedy heuristic absorbs origin and main because they're
syntactically indistinguishable from subcommand verbs (lowercase
identifiers, no path-shape). Consumers gating on git push * use
pattern-prefix length 2 — see §6.1.1.
Input: freshdesk ticket list --status open
Clauses = [
Clause {
Verb = VerbChain { Tokens = ["freshdesk", "ticket", "list"] },
Args = [
Arg { Raw = "--status", Kind = Literal, IsFlag = true },
Arg { Raw = "open", Kind = Literal, IsPath = false }
]
}
]
The walk stops at --status (a flag with no FlagsWithValue entry for
freshdesk). The full subcommand stack is captured without requiring a
curated table entry — the canonical benefit motivating the change.
Input: cd /target && cmd1 && cmd2 file.txt
Clauses = [
Clause { Verb = [cd], Args = [/target attributed-as-path], Op = None },
Clause {
Verb = [cmd1], Op = AndIf,
Args = [Arg { Raw = "/target", Resolved = "/target",
IsPath = true, IsCwdAttribution = true }]
},
Clause {
Verb = [cmd2], Op = AndIf,
Args = [
Arg { Raw = "file.txt", Resolved = "/target/file.txt", IsPath = true },
Arg { Raw = "/target", Resolved = "/target",
IsPath = true, IsCwdAttribution = true }
]
}
]
Input: git -C /repo log
Clauses = [
Clause {
Verb = VerbChain { Tokens = ["git", "log"] },
Args = [
Arg { Raw = "-C", IsFlag = true },
Arg { Raw = "/repo", IsPath = true, Resolved = "/repo" }
]
}
]
Input: cmd > /tmp/out.txt
Clauses = [
Clause {
Verb = [cmd],
Args = [],
Redirects = [Redirect { Direction = Out, Target = "/tmp/out.txt" }]
}
]
Input: cd /a && (cd /b && cmd1) && cmd2
Clauses = [
Clause { Verb = [cd], Args = [/a], Op = None },
Clause { Verb = [cd], Args = [/b], Op = AndIf, IsSubshell = true,
Args = [/a attribution from outer compound] },
Clause { Verb = [cmd1], Op = AndIf, IsSubshell = true,
Args = [/b attribution — local to subshell] },
Clause { Verb = [cmd2], Op = AndIf,
Args = [/a attribution — inherited from outer cd, NOT /b] }
]
Input: rm $UNRESOLVED/foo
Clauses = [
Clause {
Verb = [rm],
Args = [
Arg { Raw = "$UNRESOLVED/foo", Kind = DynamicSkip, IsPath = false,
Resolved = null }
]
}
]
Consumer impact: zone-gate sees zero paths to evaluate; routes to the fallback "treat as one untrusted path = the raw token" prompt.
Input: for i in 1 2; do echo $i; done
ParsedCommand {
Source = "for i in 1 2; do echo $i; done",
IsUnparseable = true,
UnparseableReason = "control-flow keyword 'for' is not supported in v0.1",
Clauses = [] // or partial; consumer should not rely on contents
}
The corpus is the acceptance contract for the parser. Implementation is "done" when every corpus entry parses to its expected AST.
tests/ShellSyntaxTree.Tests/Corpus/bash/*.json — one file per corpus
entry. File name pattern: NN_descriptive_slug.json where NN is a
zero-padded sequence number.
Each file:
{
"name": "Multi-token verb: git push",
"input": "git push origin main",
"expected": {
"isUnparseable": false,
"clauses": [
{
"operator": "None",
"verb": ["git", "push"],
"args": [
{ "raw": "origin", "kind": "Literal", "isPath": false },
{ "raw": "main", "kind": "Literal", "isPath": false }
],
"redirects": [],
"isSubshell": false,
"isBashCWrapped": false
}
]
},
"notes": "Optional explanation of edge case being captured."
}Author at least:
- 10 simple-verb cases (ls, pwd, echo, cat, grep, etc.)
- 10 multi-token-verb cases (git push, dotnet test, docker compose up, etc.)
- 15 compound cases (
&&,||,;,|combinations) - 10
cd-in-compound propagation cases (single, sequential, with subshell) - 10 quote-handling cases (single, double, escaped, mixed)
- 10 redirect cases (
>,>>,<,2>,2>>, multiple redirects) - 10 subshell cases (with and without isolation effects)
- 10
bash -crecursion cases (depth 1, 2, with inner compounds) - 10 dynamic-skip cases (
$VAR,${VAR},~user, glob args) - 10 per-verb path-rule cases (chmod, chown, find, grep, curl, git -C, etc.)
- 10 unparseable cases (unbalanced quotes, control-flow keywords, function definitions)
Total minimum: 105 entries. Strive for 150+ once seeded from sanitized real-world commands (see §14).
A single xunit test method enumerates tests/ShellSyntaxTree.Tests/Corpus/bash/*.json, parses
each input, and asserts the result matches expected field-by-field.
The runner emits a per-corpus-entry test name so failures point at the
specific case.
[Theory]
[MemberData(nameof(CorpusEntries))]
public void Corpus_entry_parses_to_expected_ast(CorpusEntry entry)
{
var parser = new BashParser();
var actual = parser.Parse(entry.Input);
AstAssert.Equal(entry.Expected, actual); // structural equality
}
public static IEnumerable<object[]> CorpusEntries()
{
var dir = Path.Combine(AppContext.BaseDirectory, "Corpus", "bash");
foreach (var file in Directory.GetFiles(dir, "*.json"))
{
var entry = JsonSerializer.Deserialize<CorpusEntry>(File.ReadAllText(file));
yield return [entry];
}
}AstAssert.Equal is a helper that does structural equality with helpful
diff messages on mismatch. Implement to taste.
A portion of the corpus seeds from real shell commands captured from
agent dogfood logs. The seed source is a daemon log file at
~/.netclaw/logs/daemon-2026-05-09.log (and similar). These logs contain
PII (usernames, repo paths, channel/thread IDs) that must not appear
in the public corpus.
Apply these transformations to every seeded entry before committing:
| Pattern | Replacement |
|---|---|
/home/<username>/ (any specific username) |
/home/user/ |
/Users/<username>/ (macOS) |
/Users/user/ |
~/<username>/ |
~/ |
Specific repo paths like /home/user/repositories/stannardlabs/<repo> |
/home/user/repos/sample-repo |
| Specific repo names (not in the org's public list) | sample-repo or project |
Slack channel IDs (D[A-Z0-9]{10}) |
<channel> (only if appears in command) |
Slack thread IDs (\d{10}\.\d{6}) |
<thread> |
| Internal hostnames | internal-host.example |
| Email addresses | user@example.com |
API keys, tokens, secrets (any [A-Za-z0-9]{20,} that looks key-shaped) |
<redacted> (but prefer to drop the entry entirely) |
- Pull candidate commands from logs:
grep -oP "command \K\{[^}]+\}" ~/.netclaw/logs/daemon-*.log \ | jq -r .Command | sort -u > /tmp/raw-corpus.txt
- Apply sanitization (script TBD) — for each line, walk the table above.
- Manual review of each sanitized entry before committing. The script can miss patterns; a human (or careful agent) reviews for residual PII.
- Drop any entry that can't be cleanly sanitized (too many specific identifiers; rewrite as a fully-synthetic entry instead).
- Commit with a clear message:
chore(corpus): seed from sanitized agent logs (NN entries).
Before any corpus PR merges, CI runs a regex check against the corpus
files for residual PII patterns. The check fails the build if any
sanitization-rule pattern appears in any committed corpus file. Implement
as a small dotnet test that scans tests/ShellSyntaxTree.Tests/Corpus/bash/*.json for the
forbidden patterns.
The repo template already has:
.github/workflows/pr_validation.yml— runsdotnet teston PR..github/workflows/publish_nuget.yml— publishes to NuGet on release tag.
Adapt for ShellSyntaxTree:
- Trigger NuGet publish on tag pattern
v*.*.*(e.g.v0.1.0-alpha). - Test job runs the corpus runner plus all unit tests.
- PII audit job runs the sanitization-pattern scan over
tests/ShellSyntaxTree.Tests/Corpus/.
- v0.1.x-alpha — pre-release alpha cycle. Public API surface per §2 is
locked; internal data and behavior are subject to course-correction
while real-world feedback lands (e.g. v0.1.4-alpha replaces the
BashAritystatic table with the greedy verb-chain heuristic per issue #27). - v0.1.0 — first publishable non-alpha cut. Bash-only.
- v0.1.x (post-0.1.0) — additive changes and SPEC-conformance fixes (more verb table entries, more corpus, bug fixes). A fix may shift the parsed-AST shape when the prior shape violated this SPEC — e.g. v0.1.5 makes a bare newline a statement separator per §4. The §2 public API surface stays locked.
- v0.2.0 — first PowerShell parser implementation (
PwshParser). Adds the sharedShellParserOptionsbase, the additiveVerbChain.CanonicalVerb/VerbChain.IsDynamicfields, and the breakingClause.IsBashCWrapped→IsCommandStringWrappedrename. A breaking AST change on a0.xminor is permitted by Appendix A whenRELEASE_NOTES.mdcarries the old→new mapping and Netclaw is updated in lockstep. SeeSPEC.POWERSHELL.md. - v1.0.0 — ready when at least one external consumer beyond Netclaw ships against it without finding API gaps.
Update RELEASE_NOTES.md for each tagged release. Format:
0.1.0-alpha YYYY-MM-DD
* First publishable cut.
* Bash parser per SPEC.md v0.1.
* Corpus: N entries.
* Public API: IShellParser, BashParser, ParsedCommand, Clause, VerbChain,
Arg, Redirect, ArgKind, RedirectDirection, CompoundOperator.
A natural order for the implementer:
- Bootstrap projects. Create
src/ShellSyntaxTree/ShellSyntaxTree.csproj(library) andtests/ShellSyntaxTree.Tests/ShellSyntaxTree.Tests.csproj(xunit). UpdateSampleSln.slnx(rename toShellSyntaxTree.slnx) and delete theAkka.Consolesample. - Update template defaults.
Directory.Build.props: replace Akka metadata with ShellSyntaxTree.README.md: real intro.LICENSE: keep Apache-2.0 (already correct).Directory.Packages.props: add xunit, drop Akka.Hosting.Tags: bash, shell, parser, ast. - Write public API skeleton (§2): interface + record stubs that compile
but throw
NotImplementedExceptiononParse(). Lock the surface first. - Implement BashLexer (§5). Heavy unit tests on tokenization.
- Implement FILE / CWD verb tables and IsVerbLikeToken predicate (§6) as static data + helper.
- Implement BashParser (§4). One production at a time; unit-test each.
- Implement Resolver (§8). Unit-test each resolution rule.
- Implement per-verb path-arg rules (§7). Unit-test per verb.
- Implement cd-in-compound propagation (§9). Unit-test.
- Implement subshell + bash -c recursion (§10). Unit-test.
- Implement parser anomaly safe-fail (§11). Unit-test.
- Author corpus (§13) — start with 105 hand-authored entries covering each section. Iterate parser to make all pass.
- Sanitize and seed from real logs (§14) — script + manual review. Add 50-100 more corpus entries.
- Wire CI (§15). Tag v0.1.0-alpha when corpus is green and PII audit passes.
Estimated implementation effort: 600-800 LOC of source + 400-600 LOC of test infrastructure + 100-150 corpus entries (~50 KB JSON).
Post-v0.1.0 increments (e.g. v0.1.5 newline-as-statement-separator) are
sequenced through IMPLEMENTATION_PLAN.md — §16 records the one-time
v0.1.0 build order, not the ongoing changelog.
v0.1.0-alpha ships when all of the following hold:
- ✅ Public API matches §2 exactly.
dotnet packproduces a ShellSyntaxTree.0.1.0-alpha.nupkg. - ✅ Every corpus entry in
tests/ShellSyntaxTree.Tests/Corpus/bash/*.jsonparses to its expected AST.dotnet testruns them all and passes. - ✅ Corpus has at least 105 entries spanning the categories in §13.
- ✅ PII audit scan over
tests/ShellSyntaxTree.Tests/Corpus/bash/*.jsonfinds zero hits. - ✅
dotnet testruns on PR via GitHub Actions and passes. - ✅ Tagging
v0.1.0-alphatriggerspublish_nuget.ymland the package appears on nuget.org. - ✅ Netclaw can consume the package via
<PackageReference>and theIShellParserresolves at runtime in Netclaw's DI container. - ✅ At least one Netclaw integration test exercises a real corpus entry through the live Netclaw matcher and gets the expected gate decision.
- PowerShell and cmd parsers.
- Variable expansion (any kind).
- Heredoc body extraction.
- Process substitution
<(cmd),>(cmd). - Function definitions.
- Arithmetic expansion
$((...)). for/while/casecontrol flow.- Performance optimization beyond "fast enough" (~1ms typical).
- Source-mapping (line/column for AST nodes — useful for IDEs, irrelevant for security gates).
- Extensible verb table loading from config (v0.1 ships static tables;
consumers can layer their own knowledge on top via
BashParserOptionsin a future version).
What Netclaw expects from this library:
IShellParseris registered in DI andParse(string)returns aParsedCommandthat Netclaw walks.- For each
Clause, Netclaw extracts:Verb.Tokensfor the verb-pattern gate evaluation.- All
ArgswhereIsPath = true(excludingIsCwdAttribution = truewhen the resolved path already appears in another arg) for the zone gate evaluation. - All
Redirectswhere the target is path-shaped — the target is a path the clause "operates on" for zone-gate purposes.
- When
IsUnparseable = true, Netclaw routes to safe-fail (prompt user; offer Once / Deny only). - When any
Arg.Kind = DynamicSkip, Netclaw treats that token as "path unknown" — falls back to prompting on the raw command for the zone gate. - Hard-deny rules in Netclaw evaluate against parsed
Clauserecords, not raw text (except therawTextescape-hatch rules — those operate on the rendered clause string, recoverable viaClause.ToCommandString()if we add it, orstring.Join(" ", verb + args + redirects)if we don't).
The contract is stable — additive changes to AST records (new fields with
default values) are compatible; renaming or removing fields is breaking.
Before v1.0.0, while the library is in its 0.x line, a breaking AST change
MAY ship in a minor bump (e.g. the Clause.IsCommandStringWrapped →
IsCommandStringWrapped rename in v0.2.0) provided RELEASE_NOTES.md
documents the old→new mapping and the consumer (Netclaw) is updated in
lockstep. From v1.0.0 onward, renaming or removing a field requires a major
version bump.
OpenCode (Node) uses tree-sitter-bash. We considered porting that approach to .NET. The packaging cost is real:
- No first-class .NET tree-sitter binding. Community bindings exist but vary in maintenance.
- Native dependency: ship
libtree-sitter+libtree-sitter-bashper platform (Linux x64, Linux arm64, macOS x64, macOS arm64, Windows x64). Five binaries to ship and maintain, plus PowerShell would need a separate native lib. - AOT-trimming compatibility is uncertain.
- We don't need IDE-grade fidelity. Fork bombs and function definitions
legitimately confuse our parser; we want them to mark
IsUnparseableso the consumer routes to safe-fail. tree-sitter would parse them and we'd have to teach the consumer to ignore the result anyway.
The hand-rolled approach trades a higher ceiling for control over scope,
zero native deps, and a clean upgrade path to PowerShell via the same
IShellParser seam. For our use case, that trade is correct.