Skip to content

Conversation

@sirreal
Copy link
Member

@sirreal sirreal commented Dec 15, 2025

The HTML API currently rejects script tag contents that may be dangerous. This is a proposal to detect JavaScript and JSON script tags and automatically escape contents when necessary.

  • JSON and JavaScript script tags may be detected according to the HTML standard.
  • Script tag contents are escaped only when <script or </script (case-insensitive) is found.

In JSON

< is replaced with \u003C. This eliminates the problematic strings and aligns with the approach described in #63851 and applied in r60681.

This is proposed as a simple character replacement with strtr. This should be highly performant. A less invasive replacement could be done to only replace < in <script or </script where it's really necessary. This would preserve more of the JSON string, but likely at the cost of performance. It would require either a regular expression with case-insensitive matching (see JavaScript example).

In JavaScript

<script and </script (followed by a necessary tag termination character \t\n\r\f/>) the s is replaced with its Unicode escape. This should remain valid in all contexts where the text can appear and maintain identical behavior in all except a few edge cases (see ticket or quoted section below for full explanation and caveats).

From the ticket:

The HTML API prevents setting SCRIPT tag that could modify the tree either by closing the SCRIPT element prematurely, or by preventing the SCRIPT element from closing at the expected close tag.

This is handled by rejecting any script tag contents that are potentially dangerous and is safe. There are some improvements that could be made.

If the contents are found to be unsafe and the type of the script tag is JSON or JavaScript (this is well specified in the HTML standard), it should be possible to apply a syntactic transformation to the contents in such a way that the script contents become safe without semantically altering the script.

If the HTML API can safely and automatically escape the majority of SCRIPT tag contents, it can then be used to for SCRIPT tag creation and has the potential to eliminate the class of problem from #40737, #62797, and #63851. It also has the potential to address part of #51159 where SCRIPT tag escaping becomes less of an issue.

JSON

In JSON SCRIPT tags, the transformation is a simple replacement of < with its Unicode escape sequence \u003C. This can be applied to the entire contents of the script or specifically in case-insensitive matches for <script and </script.

JavaScript

JavaScript SCRIPT tags are more difficult because the language has vastly more syntax. Fortunately, there is prior art described in this 2022 blog post (external) from React team member Sophie Alpert. It's the same the JavaScript SCRIPT tag contents escaping strategy that React continues to employ today. In summary, the problematic text <script and </script syntactically appear in places where Unicode escape sequences can be used in the script part (Strings, Identifiers, and RegExp literals). React takes the approach of replacing the s character, resulting in <\u0073cript or </\u0073cript, completely safe in a Script tag.

There are a few notable exceptions where the transformed JavaScript has observably different runtime behavior. These are the only examples I'm aware of. They're more esoteric parts of the language and the likelihood of them being used in inline JavaScript with the problematic text sequences seems an acceptable tradeoff to me to enable cheap, automatic JavaScript escaping.

String.raw does not process escape sequences.

'<script>' === '<\u0073cript>'; // true
String.raw`<script>` === String.raw`<\u0073cript>`; // false

Tagged templates can also access the raw strings, again a form without processing escape sequences.

function taggedCooked( strings ) {
    return strings[0];
}
taggedCooked`<script>` === taggedCooked`<\u0073cript>`; // true

function taggedRaw( strings ) {
    return strings.raw[0];
}
taggedRaw`<script>` === taggedRaw`<\u0073cript>`; // false

The source property of RegExp contains a string representation of the pattern. JavaScript RegExp support Unicode escape sequences, but the Unicode escape sequence is not transformed in the source.

const rPlain = /<script>/;
const rEscaped = /<\u0073cript>/

rPlain.test('<script>'); // true
rEscaped.test('<script>'); // true

rPlain.source === rEscaped.source; // false
rPlain.source; // '<script>'
rEscaped.source; // '<\\u0073cript>'

Any better JavaScript escaping would likely require a complete JavaScript parser and much more invasive changes. It would be much more costly to perform. Even then, I'm not sure that the escaping could be done faithfully.

String.raw() could be split and joined:

String.raw`<script>` === String.raw`<s` + String.raw`cript>`; true 

Tagged template raw and RegExp source seem much more challenging.

Trac ticket: https://core.trac.wordpress.org/ticket/64419


This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

@github-actions
Copy link

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

* > then let the script block's type string for this script element be "text/javascript".
*/
$type_attr = $this->get_attribute( 'type' );
$language_attr = $this->get_attribute( 'language' );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like this has the potential to be clearer if we did one of two things:

  • early-abort when both the type and language attributes are missing.
  • null-coalesce to some value like '' which would be semantically the same in these checks as null but allow us to treat the values as strings.

or even something like this…

$type = $this->get_attribute( 'type' );
$type = is_string( $type ) ?: '';        // combine `true` and `null` into ''.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this can be simpler. The different cases seem clear so we can also add some unit tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed this and simplified or clarified it slightly, but I think it matches the specified behavior well and I'm not sure that further changes will improve things.

Comment on lines +4042 to +4049
/*
* > Otherwise, if the script block's type string is an ASCII case-insensitive match for
* > the string "module", then set el's type to "module".
*
* A module is evaluated as JavaScript.
*/
case 'module':
return true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PhpStorm complains about having a separate case here:

Image

It suggests moving the module case up above with the others.

This is purely stylistic, I acknowledge.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer this as-is, it allows quotes referencing different specifications to remain separate.

@sirreal
Copy link
Member Author

sirreal commented Dec 19, 2025

The new is_javascript_script_tag() and is_json_script_tag() have been made private and have an ignore annotation (so they should not appear in documentation pages) as well as brief todo comments describing how a more general public method could be useful.

@sirreal sirreal marked this pull request as ready for review December 19, 2025 19:40
@github-actions
Copy link

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Unlinked Accounts

The following contributors have not linked their GitHub and WordPress.org accounts: @Copilot.

Contributors, please read how to link your accounts to ensure your work is properly credited in WordPress releases.

Core Committers: Use this line as a base for the props when committing in SVN:

Props jonsurrell, dmsnell, westonruter.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

@sirreal
Copy link
Member Author

sirreal commented Dec 19, 2025

I've polished this, addressed feedback, and added comments to explain how the escaping is being done.

This is ready for review.

* - \f
* - " " (U+0020 SPACE)
* - /
* - >
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we’re not checking for these state transitions. in the code following, we’re checking for the simplified transition, akin to the simplified diagram

I think it would be helpful to match the discussion of the approach with the comment. that is, I don’t know if it’s particularly helpful to spell out all of the characters when terser descriptions are warranted. e.g.

 * If the plaintext contains any sequences which would could be interpreted as
 * SCRIPT opening or closing tags, then it is sufficient to escape these. This
 * prevents getting into the dangerous double-escape state. Technically, what
 * matters is not the presence of a full or actual SCRIPT tag, but the start of
 * a tag containing the "SCRIPT" tag name.
 *
 * @see URL
 */
if ( false !== stripos( ... ) || false !== stripos( ... ) ) {

}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those lines capture all of the relevant state transitions, it's really about these minimal transitions that either close the script or move to double-escaped:

---
config:
---
stateDiagram-v2
  state "script data" as ScriptData
  state "escaped" as Escaped
  state "double escaped" as DoubleEscaped
  state "Close script" as CloseScript

  Escaped --> DoubleEscaped : #60;script[ \t\f\r\n/>]
  ScriptData --> CloseScript : #60;/script[ \t\f\r\n/>]
  Escaped --> CloseScript : #60;/script[ \t\f\r\n/>]
Loading

The idea being that if those transitions are prevented then the script contents cannot break the HTML structure.


I agree the documentation here needs to be reviewed broadly. It's important to document the escaping well for posterity to understand why this is implemented as it is.

*
* This escaping strategy strikes will make ALL JavaScript safe to embed in
* HTML in a way that is completely transparent in most cases.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you happy with this comment? I think some things are very elaborate and technical, and worded in ways which could use some refinement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment was one of the last things I added and could certainly use revision.

*
* There are a few exceptions where the escaped form can be detected:
*
* - The escaped form would appear in any JavaScript comments.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“There are a few exceptions where the escaped form can be detected:” is passive and feels indirect. Something more punching could go further with less.

 * This escaping cannot be done everywhere in JavaScript:
 *
 *  - Comments are not interpreted, meaning the escape sequences are visible, but
 *    only when reviewing the source code itself.
 *  - `String.raw()` and tagged template literal strings work on unescaped values.
 *  - The `source` property of a RegExp object returns unescaped strings.
 *
 * To avoid escaping in these situations it’s necessary to avoid presenting the
 * text which appears like a SCRIPT tag, for example, by splitting it into two
 * pieces and combining them.
 *
 * Example:
 *
 *     console.log( String.raw`</\u0073cript>` );           // </\u0073cript>
 *     console.log( String.raw`</scr` + String.raw`ipt>` ); // </script>

@sirreal sirreal requested a review from dmsnell December 22, 2025 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants