|
| 1 | +# Some Hypothetical Segmentation Principles |
| 2 | + |
| 3 | +Author: Skef Iterum |
| 4 | +Date: April 4, 2025 |
| 5 | + |
| 6 | +## Introduction |
| 7 | + |
| 8 | +These are some hypothetical principles for segmentation and the reasoning |
| 9 | +behind them. They are hypothetical in that at the time of writing they |
| 10 | +have not been validated empirically. |
| 11 | + |
| 12 | +One reason to put them on the table early is so that future performance tests |
| 13 | +can be constructed with these ideas in mind, attempting to validate or |
| 14 | +invalidate them rather than not taking them into account. |
| 15 | + |
| 16 | +## Background Principles |
| 17 | + |
| 18 | +1. The most common kind of document impression (a "viewing" of a document) is |
| 19 | + written in a single language, and when there are language variants that |
| 20 | + affect writing (such as Hong Kong Chinese vs Mainland Chinese), in a single |
| 21 | + language variant. Some of those documents will contain brief sections of |
| 22 | + different languages (phrases written in another language and then |
| 23 | + translated, names of people, places, or things). |
| 24 | + |
| 25 | + The second most common kind of document impression is written in two |
| 26 | + languages. And so forth. |
| 27 | + |
| 28 | +2. In practice, codepoint frequency is font-relative. |
| 29 | + |
| 30 | + Obviously when a font does not map a given codepoint, that font is unlikely |
| 31 | + to be used to render that codepoint. Most clients will support some kind of |
| 32 | + codepoint rendering fallback but fallbacks are poor aesthetically and will |
| 33 | + typically be avoided by someone choosing a specific font. |
| 34 | + |
| 35 | + Perhaps less obviously, a font is less likely to render some codepoint that |
| 36 | + is high-frequency if it lacks support for certain other codepoints that are |
| 37 | + high-frequency. For example, suppose a font supports some codepoints that |
| 38 | + are high frequency in general because they are high frequency in Japanese, |
| 39 | + but lacks support for other high frequency Japanese codepoints. Following BP |
| 40 | + 1 that font is unlikely to be used for a document written in Japanese. |
| 41 | + Therefore, one would not normally expect the Japanese codepoints the font |
| 42 | + does support to be used with high frequency. |
| 43 | + |
| 44 | +3. Spatial locality for high (to medium?) frequency codepoints is |
| 45 | + script/langauge-specific. |
| 46 | + |
| 47 | + This follows from BP 1. Loading codepoint that is high-frequency for langauge |
| 48 | + *X* is highly predictive of needing other codepoints that are high-frequency |
| 49 | + for *X*, but far less predictive of needing codepoints that are high-frequency |
| 50 | + for other languages. |
| 51 | + |
| 52 | +4. Spatial locality for (language-relative) low (to medium?) frequency |
| 53 | + codepoints is weak. |
| 54 | + |
| 55 | + A codepoint can of course be *in general* low-frequency while still being |
| 56 | + language-relative high frequency. The language itself may just be used less |
| 57 | + often. In such cases loading one language-relative high frequency codepoint |
| 58 | + is still predictive of loading another such codepoint. This isn't important |
| 59 | + to account for in the grand scheme of things, given the general low frequency, |
| 60 | + but it still helps in the cases where those codepoints are used. |
| 61 | + |
| 62 | + What doesn't help much are further attempts to exploit the locality of |
| 63 | + low-frequency codepoints. Loading one isn't very predictive of loading |
| 64 | + another. There can be exceptions, such as the box-building codepoints, or |
| 65 | + the circled numbers, but a) these are arguably quasi-scripts and b) such |
| 66 | + glyphs may be more commonly used as quasi-emojis than for their intended |
| 67 | + purposes. |
| 68 | + |
| 69 | +## Segmentation Principles |
| 70 | + |
| 71 | +1. Glyphs should be segmented by script. |
| 72 | + |
| 73 | + This is the most basic principle that follows from BP 1. |
| 74 | + |
| 75 | +2. Glyphs should be sub-segmented by language-specific high frequency, relative to |
| 76 | + "supported" languages. |
| 77 | + |
| 78 | + This follows from BP 1-4. Loading a glyph that is high-frequency in a given |
| 79 | + language is highly predictive of loading another glyph that is high frequency |
| 80 | + in that language, *if the font has general support for that language*. This |
| 81 | + suggests a procedure like: Identify the languages "supported" by the font, |
| 82 | + segment the high-frequency glyphs for those languages together. |
| 83 | + |
| 84 | +3. Glyphs that are high-frequency in one supported language but medium frequency |
| 85 | + in another supported language should be sub-segmented separately. |
| 86 | + |
| 87 | + Even if Japanese is used more frequently than Chinese, a font that supports |
| 88 | + both may be used for either, so it's better not to bias Japanese over Chinese |
| 89 | + when that is avoidable. Therefore, you shouldn't mix a codepoint that is |
| 90 | + high frequency in both Chinese and Japanese in with a codepoint that is high |
| 91 | + frequency in Japanese but not in Chinese. And further with Mainland Chinese |
| 92 | + versus Hong Kong Chinese and with Korean. Instead, permute the languages |
| 93 | + and make separate segments of shared high-frequency codepoints. |
| 94 | + |
| 95 | + The same can also be true of alphabetic languages, perhaps making segments for |
| 96 | + the "base" Latin glyphs and then accounting for pre-made accented codepoints |
| 97 | + for German, Vietnamese, French, and so forth separately. |
| 98 | + |
| 99 | +4. High-frequency segments will be "lumpier". |
| 100 | + |
| 101 | + This follows from SP 2-3. |
| 102 | + |
| 103 | + When the collection of glyphs that make sense to segment together by high frequency |
| 104 | + codepoint is large enough, one can of course sub-divide them to meet total size or |
| 105 | + glyph number targets. However, when such groupings are not large enough it is |
| 106 | + better to leave the segments as they are than to merge them with other segments |
| 107 | + just to match such targets. The handling of high-frequency codepoints is where |
| 108 | + most of the value for IFT comes from, some lumpiness is fine. |
| 109 | + |
| 110 | +5. Segments containing glyphs that are not high-frequency relative to any language |
| 111 | + should be smaller. |
| 112 | + |
| 113 | + This follows from BP 4. High-frequency codepoints are loaded with a general |
| 114 | + expectation that you'll need a bunch of them (given BP 1 and the assumption |
| 115 | + assumption that most documents aren't tiny). Therefore the size of such segments |
| 116 | + should generally lean larger to reduce overhead. Loading codepoints that aren't |
| 117 | + high-frequency relative to any language isn't very predictive, however, so the |
| 118 | + chance of needing other codepoints in the segment is low. |
| 119 | + |
| 120 | + The optimal size should be determined by other factors: One glyph per |
| 121 | + codepoint is bad from a privacy standpoint and will greatly increase the |
| 122 | + overhead of a "load the whole font" operation. So there will likely be some |
| 123 | + happy-medium target that is lower than that for high-frequency codepoints. |
0 commit comments