-
Notifications
You must be signed in to change notification settings - Fork 3.2k
HTML API: Auto-escape JavaScript and JSON script tag contents when necessary #10635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
HTML API: Auto-escape JavaScript and JSON script tag contents when necessary #10635
Conversation
Test using WordPress PlaygroundThe changes in this pull request can previewed and tested using a WordPress Playground instance. WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser. Some things to be aware of
For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation. |
| * > then let the script block's type string for this script element be "text/javascript". | ||
| */ | ||
| $type_attr = $this->get_attribute( 'type' ); | ||
| $language_attr = $this->get_attribute( 'language' ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems like this has the potential to be clearer if we did one of two things:
- early-abort when both the type and language attributes are missing.
- null-coalesce to some value like
''which would be semantically the same in these checks asnullbut allow us to treat the values as strings.
or even something like this…
$type = $this->get_attribute( 'type' );
$type = is_string( $type ) ?: ''; // combine `true` and `null` into ''.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, this can be simpler. The different cases seem clear so we can also add some unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reviewed this and simplified or clarified it slightly, but I think it matches the specified behavior well and I'm not sure that further changes will improve things.
| /* | ||
| * > Otherwise, if the script block's type string is an ASCII case-insensitive match for | ||
| * > the string "module", then set el's type to "module". | ||
| * | ||
| * A module is evaluated as JavaScript. | ||
| */ | ||
| case 'module': | ||
| return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer this as-is, it allows quotes referencing different specifications to remain separate.
Co-authored-by: Weston Ruter <westonruter@gmail.com>
|
The new |
Explain caveats and hint at workarounds.
|
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the Unlinked AccountsThe following contributors have not linked their GitHub and WordPress.org accounts: @Copilot. Contributors, please read how to link your accounts to ensure your work is properly credited in WordPress releases. Core Committers: Use this line as a base for the props when committing in SVN: To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
|
I've polished this, addressed feedback, and added comments to explain how the escaping is being done. This is ready for review. |
| * - \f | ||
| * - " " (U+0020 SPACE) | ||
| * - / | ||
| * - > |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we’re not checking for these state transitions. in the code following, we’re checking for the simplified transition, akin to the simplified diagram
I think it would be helpful to match the discussion of the approach with the comment. that is, I don’t know if it’s particularly helpful to spell out all of the characters when terser descriptions are warranted. e.g.
* If the plaintext contains any sequences which would could be interpreted as
* SCRIPT opening or closing tags, then it is sufficient to escape these. This
* prevents getting into the dangerous double-escape state. Technically, what
* matters is not the presence of a full or actual SCRIPT tag, but the start of
* a tag containing the "SCRIPT" tag name.
*
* @see URL
*/
if ( false !== stripos( ... ) || false !== stripos( ... ) ) {
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those lines capture all of the relevant state transitions, it's really about these minimal transitions that either close the script or move to double-escaped:
---
config:
---
stateDiagram-v2
state "script data" as ScriptData
state "escaped" as Escaped
state "double escaped" as DoubleEscaped
state "Close script" as CloseScript
Escaped --> DoubleEscaped : #60;script[ \t\f\r\n/>]
ScriptData --> CloseScript : #60;/script[ \t\f\r\n/>]
Escaped --> CloseScript : #60;/script[ \t\f\r\n/>]
The idea being that if those transitions are prevented then the script contents cannot break the HTML structure.
I agree the documentation here needs to be reviewed broadly. It's important to document the escaping well for posterity to understand why this is implemented as it is.
| * | ||
| * This escaping strategy strikes will make ALL JavaScript safe to embed in | ||
| * HTML in a way that is completely transparent in most cases. | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you happy with this comment? I think some things are very elaborate and technical, and worded in ways which could use some refinement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment was one of the last things I added and could certainly use revision.
| * | ||
| * There are a few exceptions where the escaped form can be detected: | ||
| * | ||
| * - The escaped form would appear in any JavaScript comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
“There are a few exceptions where the escaped form can be detected:” is passive and feels indirect. Something more punching could go further with less.
* This escaping cannot be done everywhere in JavaScript:
*
* - Comments are not interpreted, meaning the escape sequences are visible, but
* only when reviewing the source code itself.
* - `String.raw()` and tagged template literal strings work on unescaped values.
* - The `source` property of a RegExp object returns unescaped strings.
*
* To avoid escaping in these situations it’s necessary to avoid presenting the
* text which appears like a SCRIPT tag, for example, by splitting it into two
* pieces and combining them.
*
* Example:
*
* console.log( String.raw`</\u0073cript>` ); // </\u0073cript>
* console.log( String.raw`</scr` + String.raw`ipt>` ); // </script>
The HTML API currently rejects script tag contents that may be dangerous. This is a proposal to detect JavaScript and JSON script tags and automatically escape contents when necessary.
<scriptor</script(case-insensitive) is found.In JSON
<is replaced with\u003C. This eliminates the problematic strings and aligns with the approach described in #63851 and applied in r60681.This is proposed as a simple character replacement with
strtr. This should be highly performant. A less invasive replacement could be done to only replace<in<scriptor</scriptwhere it's really necessary. This would preserve more of the JSON string, but likely at the cost of performance. It would require either a regular expression with case-insensitive matching (see JavaScript example).In JavaScript
<scriptand</script(followed by a necessary tag termination character\t\n\r\f/>) thesis replaced with its Unicode escape. This should remain valid in all contexts where the text can appear and maintain identical behavior in all except a few edge cases (see ticket or quoted section below for full explanation and caveats).From the ticket:
Trac ticket: https://core.trac.wordpress.org/ticket/64419
This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.