Skip to content

Commit b9c4f77

Browse files
authored
gh-136595: Normalize surrogate pairs in REPL input to fix UnicodeEncodeError on Windows
The new REPL implementation (_pyrepl) crashes on Windows when the user inputs Unicode characters outside the Basic Multilingual Plane (≥ U+10000), such as emoji (e.g. 🐍). This happens because the Windows input layer provides surrogate pairs (UTF-16 code units) that _pyrepl attempts to process and tokenize directly, leading to unpaired surrogate handling issues. This commit introduces a `normalize_surrogates()` helper in `Reader` to explicitly normalize surrogate pairs by encoding to UTF-16 with 'surrogatepass' and decoding back. The `get_unicode()` method is patched to use this normalization so that any code consuming REPL input (e.g. syntax highlighting via tokenize) receives valid Unicode text. This resolves UnicodeEncodeError crashes in the REPL when typing emoji or other non-BMP characters on Windows. Fixes #136595
1 parent b74fb8e commit b9c4f77

1 file changed

Lines changed: 8 additions & 1 deletion

File tree

Lib/_pyrepl/reader.py

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,12 @@
4040
# syntax classes
4141
SYNTAX_WHITESPACE, SYNTAX_WORD, SYNTAX_SYMBOL = range(3)
4242

43+
def normalize_surrogates(s):
44+
# Encode with surrogatepass, decode to normalize surrogate pairs
45+
try:
46+
return s.encode('utf-16', 'surrogatepass').decode('utf-16')
47+
except UnicodeEncodeError:
48+
return s # fallback if encoding somehow fails
4349

4450
def make_default_syntax_table() -> dict[str, int]:
4551
# XXX perhaps should use some unicodedata here?
@@ -759,4 +765,5 @@ def bind(self, spec: KeySpec, command: CommandName) -> None:
759765

760766
def get_unicode(self) -> str:
761767
"""Return the current buffer as a unicode string."""
762-
return "".join(self.buffer)
768+
text = "".join(self.buffer)
769+
return normalize_surrogates(text)

0 commit comments

Comments
 (0)