Skip to content

Commit 7e39a57

Browse files
skefgarretrieger
authored andcommitted
Some segmentation principles
1 parent 005a5bd commit 7e39a57

File tree

1 file changed

+123
-0
lines changed

1 file changed

+123
-0
lines changed
Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
# Some Hypothetical Segmentation Principles
2+
3+
Author: Skef Iterum
4+
Date: April 4, 2025
5+
6+
## Introduction
7+
8+
These are some hypothetical principles for segmentation and the reasoning
9+
behind them. They are hypothetical in that at the time of writing they
10+
have not been validated empirically.
11+
12+
One reason to put them on the table early is so that future performance tests
13+
can be constructed with these ideas in mind, attempting to validate or
14+
invalidate them rather than not taking them into account.
15+
16+
## Background Principles
17+
18+
1. The most common kind of document impression (a "viewing" of a document) is
19+
written in a single language, and when there are language variants that
20+
affect writing (such as Hong Kong Chinese vs Mainland Chinese), in a single
21+
language variant. Some of those documents will contain brief sections of
22+
different languages (phrases written in another language and then
23+
translated, names of people, places, or things).
24+
25+
The second most common kind of document impression is written in two
26+
languages. And so forth.
27+
28+
2. In practice, codepoint frequency is font-relative.
29+
30+
Obviously when a font does not map a given codepoint, that font is unlikely
31+
to be used to render that codepoint. Most clients will support some kind of
32+
codepoint rendering fallback but fallbacks are poor aesthetically and will
33+
typically be avoided by someone choosing a specific font.
34+
35+
Perhaps less obviously, a font is less likely to render some codepoint that
36+
is high-frequency if it lacks support for certain other codepoints that are
37+
high-frequency. For example, suppose a font supports some codepoints that
38+
are high frequency in general because they are high frequency in Japanese,
39+
but lacks support for other high frequency Japanese codepoints. Following BP
40+
1 that font is unlikely to be used for a document written in Japanese.
41+
Therefore, one would not normally expect the Japanese codepoints the font
42+
does support to be used with high frequency.
43+
44+
3. Spatial locality for high (to medium?) frequency codepoints is
45+
script/langauge-specific.
46+
47+
This follows from BP 1. Loading codepoint that is high-frequency for langauge
48+
*X* is highly predictive of needing other codepoints that are high-frequency
49+
for *X*, but far less predictive of needing codepoints that are high-frequency
50+
for other languages.
51+
52+
4. Spatial locality for (language-relative) low (to medium?) frequency
53+
codepoints is weak.
54+
55+
A codepoint can of course be *in general* low-frequency while still being
56+
language-relative high frequency. The language itself may just be used less
57+
often. In such cases loading one language-relative high frequency codepoint
58+
is still predictive of loading another such codepoint. This isn't important
59+
to account for in the grand scheme of things, given the general low frequency,
60+
but it still helps in the cases where those codepoints are used.
61+
62+
What doesn't help much are further attempts to exploit the locality of
63+
low-frequency codepoints. Loading one isn't very predictive of loading
64+
another. There can be exceptions, such as the box-building codepoints, or
65+
the circled numbers, but a) these are arguably quasi-scripts and b) such
66+
glyphs may be more commonly used as quasi-emojis than for their intended
67+
purposes.
68+
69+
## Segmentation Principles
70+
71+
1. Glyphs should be segmented by script.
72+
73+
This is the most basic principle that follows from BP 1.
74+
75+
2. Glyphs should be sub-segmented by language-specific high frequency, relative to
76+
"supported" languages.
77+
78+
This follows from BP 1-4. Loading a glyph that is high-frequency in a given
79+
language is highly predictive of loading another glyph that is high frequency
80+
in that language, *if the font has general support for that language*. This
81+
suggests a procedure like: Identify the languages "supported" by the font,
82+
segment the high-frequency glyphs for those languages together.
83+
84+
3. Glyphs that are high-frequency in one supported language but medium frequency
85+
in another supported language should be sub-segmented separately.
86+
87+
Even if Japanese is used more frequently than Chinese, a font that supports
88+
both may be used for either, so it's better not to bias Japanese over Chinese
89+
when that is avoidable. Therefore, you shouldn't mix a codepoint that is
90+
high frequency in both Chinese and Japanese in with a codepoint that is high
91+
frequency in Japanese but not in Chinese. And further with Mainland Chinese
92+
versus Hong Kong Chinese and with Korean. Instead, permute the languages
93+
and make separate segments of shared high-frequency codepoints.
94+
95+
The same can also be true of alphabetic languages, perhaps making segments for
96+
the "base" Latin glyphs and then accounting for pre-made accented codepoints
97+
for German, Vietnamese, French, and so forth separately.
98+
99+
4. High-frequency segments will be "lumpier".
100+
101+
This follows from SP 2-3.
102+
103+
When the collection of glyphs that make sense to segment together by high frequency
104+
codepoint is large enough, one can of course sub-divide them to meet total size or
105+
glyph number targets. However, when such groupings are not large enough it is
106+
better to leave the segments as they are than to merge them with other segments
107+
just to match such targets. The handling of high-frequency codepoints is where
108+
most of the value for IFT comes from, some lumpiness is fine.
109+
110+
5. Segments containing glyphs that are not high-frequency relative to any language
111+
should be smaller.
112+
113+
This follows from BP 4. High-frequency codepoints are loaded with a general
114+
expectation that you'll need a bunch of them (given BP 1 and the assumption
115+
assumption that most documents aren't tiny). Therefore the size of such segments
116+
should generally lean larger to reduce overhead. Loading codepoints that aren't
117+
high-frequency relative to any language isn't very predictive, however, so the
118+
chance of needing other codepoints in the segment is low.
119+
120+
The optimal size should be determined by other factors: One glyph per
121+
codepoint is bad from a privacy standpoint and will greatly increase the
122+
overhead of a "load the whole font" operation. So there will likely be some
123+
happy-medium target that is lower than that for high-frequency codepoints.

0 commit comments

Comments
 (0)