Skip to content

fix: decode U+10FFFF instead of replacing it with U+FFFD#103

Open
greymoth-jp wants to merge 1 commit into
mdevils:mainfrom
greymoth-jp:fix-decode-max-codepoint
Open

fix: decode U+10FFFF instead of replacing it with U+FFFD#103
greymoth-jp wants to merge 1 commit into
mdevils:mainfrom
greymoth-jp:fix-decode-max-codepoint

Conversation

@greymoth-jp

Copy link
Copy Markdown

decode replaces the numeric reference  (and ) with U+FFFD, but U+10FFFF is a valid Unicode scalar value and encode emits it just fine:

import { encode, decode } from 'html-entities';

const s = String.fromCodePoint(0x10ffff);
encode(s, { mode: 'nonAscii' }); // ""
decode('');            // "�"  (expected "\u{10FFFF}")

So a string containing the highest code point does not survive a round trip through encode/decode, even though String.fromCodePoint(0x10ffff) is valid in JS (only 0x110000 and above throw).

The cause is an off-by-one in the bounds check inside getDecodedEntity:

decodeCode >= 0x10ffff ? outOfBoundsChar : ...

The WHATWG numeric character reference rules only substitute U+FFFD when the referenced value is greater than 0x10FFFF, so the comparison should be >. U+10FFFE and every code point below it already decode correctly; only the maximum was being caught by the >=.

This changes the check to > and adds a test covering , , the encode/decode round trip, and that � (past the Unicode range) still maps to U+FFFD.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant