Skip to content

Commit 51751f0

Browse files
committed
Fix heap buffer overflow in UTF-16 encoder via error handler rewind
When encoding UTF-16 and a user-supplied error handler rewinds the position (newpos < pos), the resize calculation counted characters instead of UTF-16 code units. Supplementary characters (>= U+10000) each need 2 UTF-16 units but were counted as 1, causing an undersized buffer allocation. The subsequent re-encoding pass would overflow the buffer by 2 bytes per supplementary character in the rewind range. Fix by counting actual UTF-16 code units needed: for UCS-4 kind strings, add an extra unit for each supplementary character in the rewind range. https://claude.ai/code/session_01XLyeaYE4CLWT5QPZYR3KDr
1 parent 300de1e commit 51751f0

1 file changed

Lines changed: 14 additions & 1 deletion

File tree

Objects/unicodeobject.c

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6387,7 +6387,20 @@ _PyUnicode_EncodeUTF16(PyObject *str,
63876387
goto error;
63886388
}
63896389
}
6390-
moreunits += pos - newpos;
6390+
/* Count UTF-16 code units needed for the rewind range.
6391+
Supplementary characters (>= U+10000) need 2 units each. */
6392+
if (newpos < pos) {
6393+
Py_ssize_t rewindunits = pos - newpos;
6394+
if (kind == PyUnicode_4BYTE_KIND) {
6395+
const Py_UCS4 *rewind_data = (const Py_UCS4 *)data;
6396+
for (Py_ssize_t i = newpos; i < pos; i++) {
6397+
if (rewind_data[i] >= 0x10000) {
6398+
rewindunits++;
6399+
}
6400+
}
6401+
}
6402+
moreunits += rewindunits;
6403+
}
63916404
pos = newpos;
63926405

63936406
/* two bytes are reserved for each surrogate */

0 commit comments

Comments
 (0)