Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 53 additions & 3 deletions src/wp-includes/kses.php
Original file line number Diff line number Diff line change
Expand Up @@ -963,6 +963,7 @@ function wp_kses_version() {
* It also matches stray `>` characters.
*
* @since 1.0.0
* @since 6.6.0 Recognize additional forms of invalid HTML which convert into comments.
*
* @global array[]|string $pass_allowed_html An array of allowed HTML elements and attributes,
* or a context name such as 'post'.
Expand All @@ -981,7 +982,18 @@ function wp_kses_split( $content, $allowed_html, $allowed_protocols ) {
$pass_allowed_html = $allowed_html;
$pass_allowed_protocols = $allowed_protocols;

return preg_replace_callback( '%(<!--.*?(-->|$))|(<[^>]*(>|$)|>)%', '_wp_kses_split_callback', $content );
$token_pattern = <<<REGEX
~
( # Detect comments of various flavors before attempting to find tags.
(<!--.*?(-->|$)) # - Normative HTML comments.
|
</[^a-zA-Z][^>]*> # - Closing tags with invalid tag names.
Comment on lines +989 to +990
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two lines are essentially the only change to the regex pattern (an addition).

)
|
(<[^>]*(>|$)|>) # Tag-like spans of text.
~x
REGEX;
return preg_replace_callback( $token_pattern, '_wp_kses_split_callback', $content );
}

/**
Expand Down Expand Up @@ -1069,23 +1081,61 @@ function _wp_kses_split_callback( $matches ) {
* @access private
* @ignore
* @since 1.0.0
* @since 6.6.0 Recognize additional forms of invalid HTML which convert into comments.
*
* @param string $content Content to filter.
* @param array[]|string $allowed_html An array of allowed HTML elements and attributes,
* or a context name such as 'post'. See wp_kses_allowed_html()
* for the list of accepted context names.
* @param string[] $allowed_protocols Array of allowed URL protocols.
*
* @return string Fixed HTML element
*/
function wp_kses_split2( $content, $allowed_html, $allowed_protocols ) {
$content = wp_kses_stripslashes( $content );

// It matched a ">" character.
/*
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should wp_kses_split2 also get an updated @since?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, great idea. Thanks for catching it!

* The regex pattern used to split HTML into chunks attempts
* to split on HTML token boundaries. This function should
* thus receive chunks that _either_ start with meaningful
* syntax tokens, like a tag `<div>` or a comment `<!-- ... -->`.
*
* If the first character of the `$content` chunk _isn't_ one
* of these syntax elements, which always starts with `<`, then
* the match had to be for the final alternation of `>`. In such
* case, it's probably standing on its own and could be encoded
* with a character reference to remove ambiguity.
*
* In other words, if this chunk isn't from a match of a syntax
* token, it's just a plaintext greater-than (`>`) sign.
*/
if ( ! str_starts_with( $content, '<' ) ) {
return '&gt;';
}

// Allow HTML comments.
/*
* When a closing tag appears with a name that isn't a valid tag name,
* it must be interpreted as an HTML comment. It extends until the
* first `>` character after the initial opening `</`.
*
* Preserve these comments and do not treat them like tags.
*/
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the > character ever not terminate the string? In other words, what would change if you removed the $ from the regexp?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, I know now. Then we couldn’t assume bkw that the content starts at the at index two and ends at index minus one

if ( 1 === preg_match( '~^</[^a-zA-Z][^>]*>$~', $content ) ) {
Copy link
Copy Markdown
Contributor

@adamziel adamziel Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any need to check for any invalid comment closers? I’m on a bus and can't remember exactly, but there were cases like triple dash or cdata-lookalike closer. Or does none of that matter with these tag-closer-like comments?

$content = substr( $content, 2, -1 );
$transformed = null;

while ( $transformed !== $content ) {
$transformed = wp_kses( $content, $allowed_html, $allowed_protocols );
$content = $transformed;
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is copied from the existing comment behavior below

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use an inline comment?


return "</{$transformed}>";
}

/*
* Normative HTML comments should be handled separately as their
* parsing rules differ from those for tags and text nodes.
*/
if ( str_starts_with( $content, '<!--' ) ) {
$content = str_replace( array( '<!--', '-->' ), '', $content );

Expand Down
29 changes: 29 additions & 0 deletions tests/phpunit/tests/kses.php
Original file line number Diff line number Diff line change
Expand Up @@ -1931,6 +1931,35 @@ public function filter_wp_kses_object_added_in_html_filter( $tags, $context ) {
return $tags;
}

/**
* Ensures that `wp_kses()` preserves various kinds of HTML comments, both valid and invalid.
*
* @ticket 61009
*
* @param string $html_comment HTML containing a comment; must not be a valid comment
* but must be syntax which a browser interprets as a comment.
* @param string $expected_output How `wp_kses()` ought to transform the comment.
*/
public function wp_kses_preserves_html_comments( $html_comment, $expected_output ) {
$this->assertSame(
$expected_output,
wp_kses( $html_comment, array() ),
'Failed to properly preserve HTML comment.'
);
}

/**
* Data provider.
*
* @return array[].
*/
public static function data_html_containing_various_kinds_of_html_comments() {
return array(
'Normative HTML comment' => array( 'before<!-- this is a comment -->after', 'before<!-- this is a comment -->after' ),
'Closing tag with invalid tag name' => array( 'before<//not a tag>after', 'before<//not a tag>after' ),
);
}

/**
* Test that attributes with a list of allowed values are filtered correctly.
*
Expand Down