1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171
|
<?php
namespace MediaWiki\Parser;
use Wikimedia\RemexHtml\Tokenizer\Attributes;
use Wikimedia\RemexHtml\Tokenizer\PlainAttributes;
use Wikimedia\RemexHtml\Tokenizer\RelayTokenHandler;
use Wikimedia\RemexHtml\Tokenizer\TokenHandler;
/**
* Helper class for Sanitizer::removeSomeTags().
* @internal
*/
class RemexRemoveTagHandler extends RelayTokenHandler {
/**
* @var string The original HTML source string (used for fallback text
* when rejecting an HTML tag).
*/
private $source;
/**
* @var array<string,true> Set of HTML tags which can be self-closed.
*/
private $htmlsingle;
/**
* @var array<string,true> Self-closed tags which are on $htmlsingle
* but not on $htmlsingleonly will be emitted as an empty element.
*/
private $htmlsingleonly;
/**
* @var array<string,true> Set of allowed HTML open/close tags.
*/
private $htmlelements;
/**
* @var ?callable(Attributes,mixed...):Attributes Callback to mutate or
* sanitize attributes.
*/
private $attrCallback;
/**
* @var ?array $args Optional extra arguments to provide to the
* $attrCallback.
*/
private $callbackArgs;
/**
* @param TokenHandler $nextHandler Handler to relay accepted tokens.
* @param string $source Input source string.
* @param array $tagData Information about allowed/rejected tags.
* @param ?callable $attrCallback Attribute handler callback.
* The full signature is ?callable(Attributes,mixed...):Attributes
* @param ?array $callbackArgs Optional arguments to attribute handler.
*/
public function __construct(
TokenHandler $nextHandler,
string $source,
array $tagData,
?callable $attrCallback,
?array $callbackArgs
) {
parent::__construct( $nextHandler );
$this->source = $source;
$this->htmlsingle = $tagData['htmlsingle'];
$this->htmlsingleonly = $tagData['htmlsingleonly'];
$this->htmlelements = $tagData['htmlelements'];
$this->attrCallback = $attrCallback;
$this->callbackArgs = $callbackArgs ?? [];
}
/**
* @inheritDoc
*/
public function comment( $text, $sourceStart, $sourceLength ) {
// Don't relay comments.
}
/**
* Takes attribute names and values for a tag and the tag name and
* validates that the tag is allowed to be present.
* This DOES NOT validate the attributes, nor does it validate the
* tags themselves. This method only handles the special circumstances
* where we may want to allow a tag within content but ONLY when it has
* specific attributes set.
*
* @param string $element
* @param Attributes $attrs
* @return bool
*
* @see Sanitizer::validateTag()
*/
private static function validateTag( string $element, Attributes $attrs ): bool {
if ( $element == 'meta' || $element == 'link' ) {
$params = $attrs->getValues();
if ( !isset( $params['itemprop'] ) ) {
// <meta> and <link> must have an itemprop="" otherwise they are not valid or safe in content
return false;
}
if ( $element == 'meta' && !isset( $params['content'] ) ) {
// <meta> must have a content="" for the itemprop
return false;
}
if ( $element == 'link' && !isset( $params['href'] ) ) {
// <link> must have an associated href=""
return false;
}
}
return true;
}
/**
* @inheritDoc
*/
public function startTag( $name, Attributes $attrs, $selfClose, $sourceStart, $sourceLength ) {
// Handle a start tag from the tokenizer: either relay it to the
// next stage, or re-emit it as raw text.
$badtag = false;
$t = strtolower( $name );
if ( isset( $this->htmlelements[$t] ) ) {
if ( $this->attrCallback ) {
$attrs = ( $this->attrCallback )( $attrs, ...$this->callbackArgs );
}
if ( $selfClose && !( isset( $this->htmlsingle[$t] ) || isset( $this->htmlsingleonly[$t] ) ) ) {
// Remove the self-closing slash, to be consistent with
// HTML5 semantics. T134423
$selfClose = false;
}
if ( !self::validateTag( $t, $attrs ) ) {
$badtag = true;
}
$fixedAttrs = Sanitizer::validateTagAttributes( $attrs->getValues(), $t );
$attrs = new PlainAttributes( $fixedAttrs );
if ( !$badtag ) {
if ( $selfClose && !isset( $this->htmlsingleonly[$t] ) ) {
// Interpret self-closing tags as empty tags even when
// HTML5 would interpret them as start tags. Such input
// is commonly seen on Wikimedia wikis with this intention.
$this->nextHandler->startTag( $name, $attrs, false, $sourceStart, $sourceLength );
$this->nextHandler->endTag( $name, $sourceStart + $sourceLength, 0 );
} else {
$this->nextHandler->startTag( $name, $attrs, $selfClose, $sourceStart, $sourceLength );
}
return;
}
}
// Emit this as a text node instead.
$this->nextHandler->characters( $this->source, $sourceStart, $sourceLength, $sourceStart, $sourceLength );
}
/**
* @inheritDoc
*/
public function endTag( $name, $sourceStart, $sourceLength ) {
// Handle an end tag from the tokenizer: either relay it to the
// next stage, or re-emit it as raw text.
$t = strtolower( $name );
if ( isset( $this->htmlelements[$t] ) ) {
// This is a good tag, relay it.
$this->nextHandler->endTag( $name, $sourceStart, $sourceLength );
} else {
// Emit this as a text node instead.
$this->nextHandler->characters( $this->source, $sourceStart, $sourceLength, $sourceStart, $sourceLength );
}
}
}
|