CSS API: Add WP_CSS_Token_Processor — streaming CSS tokenizer with sanitize() and validate()#11208
CSS API: Add WP_CSS_Token_Processor — streaming CSS tokenizer with sanitize() and validate()#11208ramonjd wants to merge 1 commit intoWordPress:trunkfrom
Conversation
…nitize() and validate() Introduces `WP_CSS_Token_Processor`, a new class in `src/wp-includes/css-api/` modelled after `WP_HTML_Tag_Processor`. It tokenizes a CSS string into a typed token stream and exposes two high-level consumers: - `sanitize(): string` — strips unsafe tokens/rules (injection guard, CDO/CDC, bad tokens, disallowed URL schemes, non-allowlisted at-rules) and returns a safe CSS string. Idempotent: sanitize(sanitize($css)) === sanitize($css). - `validate(): true|WP_Error` — returns true if the CSS is safe, or a WP_Error with a specific error code (css_injection, css_html_comment, css_malformed_token, css_unsafe_url, css_disallowed_at_rule) on the first violation found. The primary motivation is fixing the compounding corruption bug (PR WordPress#11104) where wp_kses() — an HTML sanitizer — was applied to CSS, mangling & and > characters used in CSS nesting selectors on each save for users without unfiltered_html. Security policy: - </style anywhere → sanitize() returns ''; validate() returns css_injection error - url() with javascript:, data:, or non-wp_allowed_protocols() scheme → stripped - @import, @charset, @namespace, unknown at-rules → stripped (safety-first) - bad-url-token, bad-string-token → stripped - CDO/CDC (<!-- / -->) → stripped - Null bytes → stripped in constructor Allowed at-rules: @media, @supports, @Keyframes, @-webkit-keyframes, @layer, @container, @font-face. Also adds low-level navigation (next_token, get_token_type, get_token_value, get_block_depth) and non-destructive modification (remove_token, set_token_value, get_updated_css) APIs, plus get_removed_tokens() for sanitize() introspection. Integration with filter_block_kses_value() in blocks.php is a follow-on PR. Includes: - src/wp-includes/css-api/class-wp-css-token-processor.php (~1,250 lines) - src/wp-includes/css-api/README.md - tests/phpunit/tests/css-api/WpCssTokenProcessorTest.php (67 tests) - tests/phpunit/tests/css-api/WpCssTokenSanitizeTest.php (40 tests) - tests/phpunit/tests/css-api/WpCssTokenValidateTest.php (14 tests + data provider) - docs/plans/2026-03-06-wp-css-token-processor-design.md - docs/plans/2026-03-06-wp-css-token-processor.md Fixes #64771 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
0aaf313 to
f07605f
Compare
Test using WordPress PlaygroundThe changes in this pull request can previewed and tested using a WordPress Playground instance. WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser. Some things to be aware of
For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation. |
|
@ramonjd this is an interesting claim that I’m having trouble understanding
what definition are you using for
out of curiosity, what are you trying to communicate, or what are you expecting us to interpret by showing approximate line count and test counts, especially since these estimates are wrong and since the actual line counts are trivial to see? is this an aspirational goal? to remove 150 lines from the class?
What did you have in mind when you created this method? did you have example code you can share that motivated its design?
I think these would be good to discuss. We have tried very hard to avoid this confusing nomenclature because of how ambiguous it is; it gives the impression that there is some kind of universal algorithm for sanitizing or validating CSS, but as this code makes clear, those choices you have made are extremely specific and non-universal. Did you by any chance review the prior art in the PHP Toolkit? This appears to be an extremely ambitious design proposal all at once. I applaud your enthusiasm. I must admit, I am confused by many elements of the code, especially some parsing choices.
I’m going to stop here because I think we have some major concerns with this code. It would be really useful to hear some of the rationale for some of the biggest choices you’ve made: what went into the design motivation for the choice of methods and interface; how you expect to see people use this; etc… The parsing defects are just that, they can be solved, but the design choices we will live with forever. |
|
Hi @dmsnell Thanks for spending time digesting this, I apologize that you had to. I didn't intend for anyone to waste time digesting the slop, but I value your input and will reflect on your questions. I thought I had contained this PR to my fork, but I obviously made a mistake. For transparency, this is 99% an agent-generated experiment so that I could learn and ruminate about the issues involved. I understand you folks have already started thinking about this, and have a great deal of experience, so I should have flagged it as not for consumption, and made it clear that I'm not proposing anything. In the meantime, I'll update the description (and close) to make it plain that folks needn't treat it as a serious proposal. Without defending any of the code here, and given it represents a crash course for me, I can speak to one or two of your concerns as there were some human design choices, albeit naive ones:
Good point, the names imply universality. I was seduced by the comfort of the classname to categorize the functionality.
The design intent was that, after
Thanks for pointing to the PHP Toolkit work. I wasn't aware of it. 🙇🏻 So that's a pure tokenizer if I'm not mistaken with no security opinions baked in, and it occurs to me that the concerns should be distinct, i.e., any sanitization/allow-list policy is a separate, application-specific class. Given the prior art, what is your instinct - does it make more sense to you to pursue a path that builds on CSSProcessor from php-toolkit? (Asking purely from an academic/theoretical perspective). Thanks! |
It would be interesting to see calling code that does this. I’m all for the idea of communicating, but I also wonder how many developers would go through the extensive work to make this happen across the full UX. For instance, where do we report those notices if we’re calling this from inside Nothing in the HTML API family currently makes surprise edits, and of the functions built on top of them, I don’t think that level of granularity is being communicated.
That’s true, but it’s also how all of the legacy interfaces work. One of the projects I would love to do some time is build an LSP implementation for HTML using the HTML API, because it can report all the locations within a document where it may parse differently than it appears. A similar thing could happen with CSS, and that might be a neat tool while editing it. Editing seems to be the place where errors are most useful, because it gives knowledge to the editor that what they believe they are doing doesn’t match what will actually happen. One trick we have to resolve when thinking of this as a batch operation is the recursive nature of it: how far do we parse before giving up? We might remove a token which changes the rest of the document, so can stop at the first token we don’t like, but if we were to run the document through again, it might be different. We could report all removed tokens, but the set of removed tokens might be distinct from the same set if run through a second time. For further background, it may be helpful to read up on “incremental parsers,” the domain of parses used in IDEs which are meant to continue parsing in the presence of errors. Despite the claim of idempotentcy, unless running
Based on all the lies, fabrications, security exploits, broken parsing code, self-contradictions, and grandstanding, I had a few suspicions; it stood out to me because it was far far far below the level of quality I have come to see you produce. Sorry for reviewing before it was ready.
Yes, this has been a point of design discussion with the HTML API too, but I suspect it will always be too hard to apply arbitrary and non-standard sanitization rules in a system that’s also trying to properly parse and understand the code it’s reading. A good example of this is the change that went into mapping JavaScript And even if it weren’t for any other reason, it becomes practically impossible to build a decent test suite because we can no longer test that spec-compliance of the code since we are changing how WordPress parses CSS; and we can’t fully test our ad-hoc rules because we don’t have a grounding standardized base against which to truly say, “this is the parsed CSS that we analyzed” (because what we think we are analyzing might not be what a browser would analyze, due to giving up spec-compliance). For what it’s worth, the HTML API already provides a truly safe way to include CSS in a
These things are best ripening in their proper season. The work in the PHPToolkit is good and usable and serving that project well. However, we had independent implementations of the earliest form of the HTML API before any code was in Core. That process of building multiple cleanroom implementations led us to catch things and produce a much better result than we could have had with a focus on a single idea or direction, and I confidently believe it has paid multiple dividends on the seeming duplicate work early on. So while I have personally tried to limit my exposure to the I think that a few of us have been paying attention to a lot of work in the 7.0 release, a lot of bugs in Core and other legacy code, and trying to figure out what WordPress’s needs are for CSS handling. What is code already doing, and where is the existing functionality failing those developers? The best place to start is always, in my opinion, to write out the code you wish you could write and then see how that fits with the facts that we know about the system. |
|
Thanks again for your thoughtful reply, Dennis.
I have a long-standing desire to audit and refactor some of the theme json global styles CSS handling. That job is not entirely related to pure CSS processing, yet you expressed the situation pretty well! Theme JSON is a bit of a piñata, and not in the fun sense. And it seems like every week we're adding CSS properties that have been stable and safe for decades to
Good question, and I don't quite have an answer. The origin is a half-baked idea that I need to flesh out, but it involves a WP ability that shares rules about what CSS is supported by the theme and the system. I expect a lot of agent-driven CSS is going to be shoved down the WP pie hole. Anyway, I'll be keen to follow the ongoing conversations about a CSS API and related issues in other fora. Once again, thanks for spending time helping. 🙇🏻 |
🚧 🚧 🚧 🚧 🚧 🚧 🚧 🚧 🚧
Please ignore this PR, it's designed for learning purposes.
🚧 🚧 🚧 🚧 🚧 🚧 🚧 🚧 🚧
Introduces
WP_CSS_Token_Processor, a new class insrc/wp-includes/css-api/modelled afterWP_HTML_Tag_Processor. It tokenizes a CSS string into a typed token stream and exposes two high-level consumers:sanitize(): string— strips unsafe tokens/rules (injection guard, CDO/CDC, bad tokens, disallowed URL schemes, non-allowlisted at-rules) and returns a safe CSS string. Idempotent: sanitize(sanitize($css)) === sanitize($css).validate(): true|WP_Error— returns true if the CSS is safe, or a WP_Error with a specific error code (css_injection, css_html_comment, css_malformed_token, css_unsafe_url, css_disallowed_at_rule) on the first violation found.The primary motivation is fixing the compounding corruption bug (PR #11104) where wp_kses() — an HTML sanitizer — was applied to CSS, mangling & and > characters used in CSS nesting selectors on each save for users without unfiltered_html.
Security policy:
@import,@charset,@namespace, unknown at-rules → stripped (safety-first)Allowed at-rules:
@media, @supports, @keyframes, @-webkit-keyframes, @layer, @container, @font-face.Also adds low-level navigation (next_token, get_token_type, get_token_value, get_block_depth) and non-destructive modification (remove_token, set_token_value, get_updated_css) APIs, plus get_removed_tokens() for sanitize() introspection.
Integration with filter_block_kses_value() in blocks.php is a follow-on PR.
Includes:
Fixes #64771
Trac ticket:
Use of AI Tools
This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.