Fix heap buffer overflow in UTF-16 encoder via error handler rewind

claude · claude · commit 51751f0c4abe · 2026-03-05T02:20:31.000Z
When encoding UTF-16 and a user-supplied error handler rewinds the position (newpos < pos), the resize calculation counted characters instead of UTF-16 code units. Supplementary characters (>= U+10000) each need 2 UTF-16 units but were counted as 1, causing an undersized buffer allocation. The subsequent re-encoding pass would overflow the buffer by 2 bytes per supplementary character in the rewind range. Fix by counting actual UTF-16 code units needed: for UCS-4 kind strings, add an extra unit for each supplementary character in the rewind range. https://claude.ai/code/session_01XLyeaYE4CLWT5QPZYR3KDr
diff --git a/Objects/unicodeobject.c b/Objects/unicodeobject.c
@@ -6387,7 +6387,20 @@ _PyUnicode_EncodeUTF16(PyObject *str,
                 goto error;
             }
         }
-        moreunits += pos - newpos;
+        /* Count UTF-16 code units needed for the rewind range.
+           Supplementary characters (>= U+10000) need 2 units each. */
+        if (newpos < pos) {
+            Py_ssize_t rewindunits = pos - newpos;
+            if (kind == PyUnicode_4BYTE_KIND) {
+                const Py_UCS4 *rewind_data = (const Py_UCS4 *)data;
+                for (Py_ssize_t i = newpos; i < pos; i++) {
+                    if (rewind_data[i] >= 0x10000) {
+                        rewindunits++;
+                    }
+                }
+            }
+            moreunits += rewindunits;
+        }
         pos = newpos;
 
         /* two bytes are reserved for each surrogate */