Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
303 changes: 303 additions & 0 deletions docs/plans/2026-03-06-wp-css-token-processor-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,303 @@
# Design: WP_CSS_Token_Processor

**Date:** 2026-03-06
**Status:** Approved
**Related:** https://github.com/WordPress/wordpress-develop/pull/11104, https://core.trac.wordpress.org/ticket/64771

---

## Background

When a user without `unfiltered_html` (e.g. Author role, or site admins on some multisite configurations) saves a post containing block-level custom CSS (`attrs.style.css`) with `&` or `>` characters, the `filter_block_content()` pipeline corrupts the CSS through a three-step mangling chain:

1. `parse_blocks()` / `json_decode()` β€” `\u0026` becomes `&`
2. `filter_block_kses_value()` / `wp_kses()` β€” `&` becomes `&`, `>` becomes `>` (KSES treats CSS as HTML)
3. `serialize_block_attributes()` / `json_encode()` β€” `&` becomes `\u0026amp;`

Each subsequent save compounds the corruption. The root cause is that `wp_kses()` is an HTML sanitizer being applied to CSS β€” the wrong tool for the job. This class is the right tool.

---

## Scope

### In scope (this session)

- `WP_CSS_Token_Processor` class β€” streaming CSS tokenizer
- `sanitize()` instance method β€” strips unsafe tokens/rules, returns safe CSS string
- `validate()` instance method β€” returns `true|WP_Error`
- `get_updated_css()` instance method β€” reconstruct CSS after manual token modifications
- `get_removed_tokens()` instance method β€” inspection after `sanitize()`
- Low-level navigation and modification methods
- Full inline PHPDoc
- `README.md` in `src/wp-includes/css-api/`
- Full test suite

### Out of scope (follow-on sessions)

- Integration with `filter_block_kses_value()` in `blocks.php`
- `WP_CSS_Processor` β€” rule/declaration-aware layer (v2)
- Replacing `process_blocks_custom_css()` in `WP_Theme_JSON`
- CSS selector query engine (TODO in `class-wp-block.php:385`)
- Customizer CSS and Global Styles CSS pipeline adoption

---

## Architecture

### Directory structure

```
src/wp-includes/
└── css-api/
β”œβ”€β”€ class-wp-css-token-processor.php
└── README.md

tests/phpunit/tests/
└── css-api/
β”œβ”€β”€ WpCssTokenProcessorTest.php
β”œβ”€β”€ WpCssTokenSanitizeTest.php
└── WpCssTokenValidateTest.php
```

### Component map

```
WP_CSS_Token_Processor β€” tokenizes a CSS string into a typed token stream
|
| sanitize(): string β€” strips unsafe tokens/rules, returns safe CSS
| validate(): true|WP_Error β€” returns true, or WP_Error with reason code
| get_updated_css(): string β€” reconstruct after manual token modifications
```

The integration point (`filter_block_kses_value()` dispatching to `sanitize()` for `['style','css']` paths) is a follow-on PR and is not part of this session.

---

## `WP_CSS_Token_Processor`

### Design principles

- **Spec-inspired, safety-first** β€” follows the CSS Syntax Level 3 token vocabulary and structure, but prioritises correctness on security-relevant tokens over completeness. Gaps cause rejection/stripping rather than silent pass-through.
- **Forward-only streaming** β€” like `WP_HTML_Tag_Processor`, the processor advances a cursor through the input. No backtracking except via bookmarks (v2).
- **Non-destructive modification** β€” operates on the original string buffer and applies edits on output via `get_updated_css()`.
- **Instance-based API** β€” consistent with `WP_HTML_Tag_Processor`. Create an instance, call methods, retrieve output.

### Token types

#### Security-critical (must be correct)

| Constant | Examples | Notes |
|---|---|---|
| `WP_CSS_Token_Processor::URL_TOKEN` | `url(foo.png)` | Protocol-filtered against `wp_allowed_protocols()` |
| `WP_CSS_Token_Processor::BAD_URL_TOKEN` | `url(foo bar)` | Malformed URL β€” stripped |
| `WP_CSS_Token_Processor::STRING_TOKEN` | `"hello"`, `'world'` | Quoted strings |
| `WP_CSS_Token_Processor::BAD_STRING_TOKEN` | Unterminated string | Stripped |
| `WP_CSS_Token_Processor::AT_KEYWORD_TOKEN` | `@media`, `@import` | At-rule allowlist enforced in `sanitize()` |
| `WP_CSS_Token_Processor::OPEN_CURLY_TOKEN` | `{` | Block depth tracking |
| `WP_CSS_Token_Processor::CLOSE_CURLY_TOKEN` | `}` | Block depth tracking |

#### Structurally important

| Constant | Examples |
|---|---|
| `WP_CSS_Token_Processor::IDENT_TOKEN` | `color`, `red`, `sans-serif` |
| `WP_CSS_Token_Processor::FUNCTION_TOKEN` | `calc(`, `var(`, `rgb(` |
| `WP_CSS_Token_Processor::DELIM_TOKEN` | `&`, `>`, `+`, `~`, `*` |
| `WP_CSS_Token_Processor::DIMENSION_TOKEN` | `16px`, `1.5rem`, `100vh` |
| `WP_CSS_Token_Processor::PERCENTAGE_TOKEN` | `50%` |
| `WP_CSS_Token_Processor::NUMBER_TOKEN` | `42`, `1.5` |
| `WP_CSS_Token_Processor::HASH_TOKEN` | `#ff0000`, `#my-id` |
| `WP_CSS_Token_Processor::WHITESPACE_TOKEN` | Preserved in output |
| `WP_CSS_Token_Processor::SEMICOLON_TOKEN` | `;` |
| `WP_CSS_Token_Processor::COLON_TOKEN` | `:` |
| `WP_CSS_Token_Processor::COMMA_TOKEN` | `,` |

#### Stripped unconditionally

| Constant | Reason |
|---|---|
| `WP_CSS_Token_Processor::CDO_TOKEN` | `<!--` β€” HTML comments have no place in CSS |
| `WP_CSS_Token_Processor::CDC_TOKEN` | `-->` β€” HTML comments have no place in CSS |
| Null bytes | Stripped in preprocessing, before tokenization |
| `</style` sequence | Injection guard β€” `sanitize()` returns `''`, `validate()` returns `WP_Error` |

#### Out of scope for v1 (documented gaps β€” treated as unknown, stripped)

- Unicode range tokens (`U+`)
- Surrogate pair edge cases beyond basic UTF-8

### API surface

#### Construction

```php
$processor = new WP_CSS_Token_Processor( string $css );
```

#### Low-level navigation

```php
$processor->next_token(): bool // Advance cursor. Returns false at EOF.
$processor->get_token_type(): string // Token type constant for current token.
$processor->get_token_value(): string // Raw value of current token.
$processor->get_block_depth(): int // Current { } nesting depth.
```

#### Low-level modification

```php
$processor->set_token_value( string $value ): bool // Replace current token's value.
$processor->remove_token(): bool // Remove current token from output.
```

#### High-level consumers (primary public API)

```php
$processor->sanitize(): string // Strip unsafe tokens/rules. Returns safe CSS string.
$processor->validate(): true|WP_Error // true if safe, WP_Error with code if not.
$processor->get_updated_css(): string // Reconstruct CSS after manual token modifications.
$processor->get_removed_tokens(): array // Log of what was stripped and why, after sanitize().
```

---

## Security Policy

### `sanitize()` β€” token-level rules

Applied during tokenization, before structural analysis:

| Condition | Action |
|---|---|
| `</style` anywhere in input | Return `''` immediately β€” do not continue |
| Null bytes | Strip in preprocessing |
| `bad-url-token`, `bad-string-token` | Strip token |
| `CDO-token`, `CDC-token` | Strip token |
| `url-token` with `javascript:` or `data:` | Strip token entirely |
| `url-token` with other disallowed protocol | Replace URL value with `''`, preserve `url()` wrapper |

### `sanitize()` β€” rule-level rules

Applied during structural traversal, after tokenization:

**At-rule allowlist:**

```
Allowed: @media, @supports, @keyframes, @layer, @container, @font-face
Blocked: @import, @charset, @namespace
Unknown: stripped (safety-first β€” gaps reject, not pass-through)
```

Strip granularity: declaration fails β†’ drop declaration; rule fails β†’ drop rule; rest of CSS preserved.

### `validate()` rules

Returns `WP_Error` if any of the following are present:

| Condition | Error code |
|---|---|
| `</style` sequence | `css_injection` |
| `bad-url-token` or `bad-string-token` | `css_malformed_token` |
| Disallowed `url()` protocol | `css_unsafe_url` |
| Blocked or unknown at-rule | `css_disallowed_at_rule` |
| Null bytes | `css_null_byte` |
| `CDO-token` / `CDC-token` | `css_html_comment` |

`validate()` passing is a guarantee that `sanitize()` is a no-op on the same input.

### What the security policy explicitly does NOT do

- Does not validate property names or values β€” authoring intent, not a security concern
- Does not restrict CSS nesting depth
- Does not filter `var()` or custom properties β€” cannot execute code
- Does not block `expression()` β€” IE-era only, not worth the complexity

### Idempotency guarantee

`sanitize()` must be idempotent:

```
sanitize( sanitize( $css ) ) === sanitize( $css )
```

This is a hard requirement enforced by the test suite. It directly addresses the compounding corruption bug in PR #11104.

---

## Documentation

### Inline PHPDoc

- Every public method: `@since`, `@param`, `@return`, usage example
- Class docblock: purpose, what it is not, spec reference, usage examples, known gaps
- Security decisions commented with *why*, not just *what*

### README.md

Located at `src/wp-includes/css-api/README.md`. Covers:

- Purpose and scope
- Quick usage examples for `sanitize()` and `validate()`
- Token type reference
- Security policy summary
- Known gaps and future work

---

## Testing

### Test files

```
tests/phpunit/tests/css-api/
β”œβ”€β”€ WpCssTokenProcessorTest.php β€” tokenizer unit tests
β”œβ”€β”€ WpCssTokenSanitizeTest.php β€” sanitize() tests
└── WpCssTokenValidateTest.php β€” validate() tests
```

### Test categories

#### Tokenizer unit tests (`WpCssTokenProcessorTest.php`)

- Each token type in isolation: correct `get_token_type()` and `get_token_value()`
- Token sequences: declaration, qualified rule, nested rule
- Block depth tracking via `get_block_depth()`
- Edge cases: empty input, whitespace-only, single character
- Manual modification: `set_token_value()`, `remove_token()`, `get_updated_css()`

#### Sanitize tests (`WpCssTokenSanitizeTest.php`)

- CSS nesting selectors (`&`, `& > p`, `& + span`) survive unchanged
- Child combinator (`>`) survives unchanged
- Valid at-rules (`@media`, `@supports`, `@keyframes`) survive unchanged
- Blocked at-rule (`@import`) is stripped entirely
- Unknown at-rule is stripped
- `url()` with allowed protocol survives
- `url()` with `javascript:` is stripped entirely
- `url()` with `data:` is stripped entirely
- `bad-url-token` is stripped
- `bad-string-token` is stripped
- `</style` input returns `''`
- Null bytes are stripped
- `CDO` / `CDC` tokens are stripped
- `get_removed_tokens()` is populated after stripping
- `get_removed_tokens()` is empty when nothing is stripped
- **Idempotency**: `sanitize(sanitize($css)) === sanitize($css)` over a broad fixture set
- **Regression fixtures from PR #11104**:
- `color: blue; & p { color: red; }` survives unchanged
- `& > p { margin: 0; }` survives unchanged
- Repeated saves do not compound corruption

#### Validate tests (`WpCssTokenValidateTest.php`)

- Valid CSS returns `true`
- Each blocked condition returns `WP_Error` with the correct error code
- `validate()` passing guarantees `sanitize()` is a no-op (tested over fixture set)

---

## Open questions (deferred)

- Should `get_removed_tokens()` be structured (array of `['token' => ..., 'reason' => ...]`) or flat? TBD during implementation.
- Should the at-rule allowlist be filterable via a WordPress filter hook (like `safe_style_css`)? Likely yes, deferred to implementation.
- Exact `@since` version tag β€” placeholder `X.X.0` during development.
Loading
Loading