Skip to content

Commit 3d9602a

Browse files
committed
Add the auto segmenter config to font2ift.
This allows font2ift to perform the full IFT encoding process: 1. Auto generate segmenter config. 2. Run segmenter. 3. Compile the font. If a segmentation plan is not supplied to font2ift it will then using the segemnter auto config and closure segmenter to generate one.
1 parent ce31bc8 commit 3d9602a

12 files changed

+327
-128
lines changed

README.md

Lines changed: 89 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@ script:
5858
## Documentation
5959

6060
The documents under [docs/experimental](docs/experimental) provide some more detailed designs of various aspects of the IFT encoder. Of note:
61-
* [compiler.md](docs/experimental)
61+
* [compiler.md](docs/experimental/compiler.md)
6262
* [closure_glyph_segmentation.md](docs/experimental/closure_glyph_segmentation.md)
6363
* [closure_glyph_segmentation_merging.md](docs/experimental/closure_glyph_segmentation_merging.md)
6464
* [closure_glyph_segmentation_complex_conditions.md](docs/experimental/closure_glyph_segmentation_complex_conditions.md)
@@ -77,13 +77,81 @@ bazel run @hedron_compile_commands//:refresh_all
7777

7878
Will generate a compile_commands.json file.
7979

80-
## Producing IFT Encoded Fonts
80+
## Producing IFT Encoded Fonts (with Auto Config)
8181

82-
IFT encoded fonts are produced in two steps:
83-
1. A segmentation plan is generated which specifies how the font file should be split up in the IFT encoding.
84-
2. The IFT encoded font and patches are compiled by the Compiler sub module using the segmentation plan.
82+
The simplest way to create IFT fonts is via the `font2ift` utility utilizing the auto configuration mode.
83+
This is done by running the utility and not providing a segmentation plan. Example invocation:
8584

86-
### Generating Segmentation Plan
85+
```bash
86+
bazel run -c opt @ift_encoder//util:font2ift -- \
87+
--input_font="$HOME/fonts/myfont/MyFont.ttf" \
88+
--output_path=$HOME/fonts/myfont/ift/ \
89+
--output_font="MyFont-IFT.woff2"
90+
```
91+
92+
This will analyze the input font, decide how to segment it, and then produce the final IFT encoded font
93+
and patches.
94+
95+
When utilizing auto config there are two optional flags which can be used to adjust the behaviour:
96+
* `--auto_config_primary_script`: this tells the config generator which language/script the font is intended
97+
to be used with. It has two effects: first the codepoints of the primary script are eligible to be moved
98+
into the initial font. Second for scripts with large overlaps, such as CJK, primary script selects which
99+
of the overlapping scripts to use frequency data from. Values refer to frequency data files in
100+
[ift-encoder-data](https://github.com/w3c/ift-encoder-data/tree/main/data). Example values: "Script_bengali",
101+
"Language_fr"
102+
103+
* `--auto_config_quality`: This is analagous to a quality level in a compression library. It controls how much
104+
effort is spent to improve the efficiency of the final IFT font. Values range from 1 to 8, where higher
105+
values increase encoding times but typically result in a more efficient end IFT font (ie. less bytes
106+
transferred by clients using it).
107+
108+
Example command line with optional flags:
109+
110+
```bash
111+
bazel run -c opt @ift_encoder//util:font2ift -- \
112+
--input_font="$HOME/fonts/NotoSansJP-Regular.otf" \
113+
--output_path=$HOME/fonts/ift/ \
114+
--output_font="NotoSansJP-Regular-IFT.woff2" \
115+
--auto_config_primary_script=Script_japanese \
116+
--auto_config_quality=3
117+
```
118+
119+
*Note: the auto configuration mode is still under development, in particular the auto selection of quality level
120+
is currently quite simplistic. It's expected to continue to evolve from it's current state.*
121+
122+
## Producing IFT Encoded Fonts (Advanced)
123+
124+
Under the hood IFT font encoding happens in three stages:
125+
126+
1. Generate or write a segmenter config for the font.
127+
2. Generate a segmentation plan, which describes how the font is split into patches. Takes the segmenter config as an input.
128+
3. Compile the final IFT encoded font following the segmentation plan.
129+
130+
For more advanced use cases these steps can be performed individually. This allows the segmenter config
131+
and segmentation plans to be fine tuned beyond what auto configuration is capable of.
132+
133+
### Step 1: Generating a Segmenter Config
134+
135+
There are two main options for generating a segmenter config:
136+
137+
1. Write the config by hand, the segmenter is configured via an input configuration file using the
138+
[segmenter_config.proto](util/segmenter_config.proto) schema, see the comments there for more details.
139+
This option is useful when maximum control over segmentation parameters is needed, or custom frequency
140+
data is being supplied.
141+
142+
2. Auto generate the segmenter config using `util:generate_segmenter_config`.
143+
144+
```
145+
CC=clang bazel run //util:generate_segmenter_config -- \
146+
--quality=5 \
147+
--input_font=$HOME/MyFont.ttf > config.txtpb
148+
```
149+
150+
This analyzes the input font and tries to pick appropriate config values automatically. As discussed in
151+
the previous "Producing IFT Encoded Fonts" section there is a configurable quality level. If needed
152+
the auto generated config can be hand tweaked after generation.
153+
154+
### Step 2: Generating Segmentation Plan
87155

88156
Segmentation plans are in a [textproto format](https://protobuf.dev/reference/protobuf/textformat-spec/) using the
89157
[segmentation_plan.proto](util/segmentation_plan.proto) schema. See the comments in the schema file for more information.
@@ -93,17 +161,9 @@ possible to write plans by hand, or develop new utilities to generate plans.
93161

94162
In this repo 3 options are currently provided:
95163

96-
1. `util/generate_table_keyed_config`: this utility generates the table keyed (extension segments that augment non
97-
glyph data in the font) portion of a plan. Example execution:
98-
99-
```sh
100-
bazel run -c opt util:generate_table_keyed_config -- \
101-
--font=$(pwd)/myfont.ttf \
102-
latin.txt cyrillic.txt greek.txt > table_keyed.txtpb
103-
```
104-
105-
2. `util/closure_glyph_keyed_segmenter_util`: this utility uses a subsetting closure based approach to generate a glyph
106-
keyed segmentation plan (extension segments that augment glyph data). Example execution:
164+
1. [Recommended] `util/closure_glyph_keyed_segmenter_util`: this utility uses a subsetting closure based approach
165+
to generate a glyph keyed segmentation plan (extension segments that augment glyph data). It can optionally
166+
generate the table keyed portion of the config as well. Example execution:
107167

108168
```sh
109169
bazel run -c opt util:closure_glyph_keyed_segmenter_util -- \
@@ -119,6 +179,15 @@ In this repo 3 options are currently provided:
119179
Note: this utility is under active development and still very experimental. See
120180
[the status section](docs/experimental/closure_glyph_segmentation.md#status) for more details.
121181

182+
2. `util/generate_table_keyed_config`: this utility generates the table keyed (extension segments that augment non
183+
glyph data in the font) portion of a plan. Example execution:
184+
185+
```sh
186+
bazel run -c opt util:generate_table_keyed_config -- \
187+
--font=$(pwd)/myfont.ttf \
188+
latin.txt cyrillic.txt greek.txt > table_keyed.txtpb
189+
```
190+
122191
3. `util/iftb2config`: this utility converts a segmentation obtained from the
123192
[binned incremental font transfer prototype](https://github.com/adobe/binned-ift-reference)
124193
into and equivalent segmentation plan. Example execution:
@@ -128,23 +197,20 @@ In this repo 3 options are currently provided:
128197
bazel run util:iftb2config > segmentation_plan.txtpb
129198
```
130199

131-
If seperate glyph keyed and table keyed configs were generated using #1 and #2 they can then be combined into one
200+
If separate glyph keyed and table keyed configs were generated using #1 and #2 they can then be combined into one
132201
complete plan by concatenating them:
133202

134203
```sh
135204
cat glyph_keyed.txtpb table_keyed.txtpb > segmentation_plan.txtpb
136205
```
137206

138-
Additional tools for generating encoder configs are planned to be added in the future.
139-
140207
For concrete examples of how to generate IFT fonts, see the [IFT Demo](https://github.com/garretrieger/ift-demo).
141208
In particular the [Makefile](https://github.com/garretrieger/ift-demo/blob/main/Makefile) and the
142209
[segmenter configs](https://github.com/garretrieger/ift-demo/tree/main/config) may be helpful.
143210

144-
### Generating an IFT Encoding
211+
### Step 3: Generating an IFT Encoding
145212

146-
Once an segmentation plan has been created it can be combined with the target font to produce and incremental font and collection
147-
of associated patches using the font2ift utility which is a wrapper around the compiler. Example execution:
213+
Once a segmentation plan has been created it can be combined with the target font to produce an incremental font and collection of associated patches using the font2ift utility which is a wrapper around the compiler. Example execution:
148214

149215
```sh
150216
bazel -c opt run util:font2ift -- \

ift/encoder/closure_glyph_segmenter.cc

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -734,4 +734,37 @@ Status ClosureGlyphSegmenter::FallbackCost(
734734
return absl::OkStatus();
735735
}
736736

737+
void ClosureGlyphSegmenter::AddTableKeyedSegments(
738+
SegmentationPlan& plan,
739+
const btree_map<SegmentSet, MergeStrategy>& merge_groups,
740+
const std::vector<SubsetDefinition>& segments,
741+
const SubsetDefinition& init_segment) {
742+
std::vector<SubsetDefinition> table_keyed_segments;
743+
for (const auto& [segment_ids, _] : merge_groups) {
744+
SubsetDefinition new_segment;
745+
for (uint32_t s : segment_ids) {
746+
new_segment.Union(segments.at(s));
747+
}
748+
new_segment.Subtract(init_segment);
749+
table_keyed_segments.push_back(new_segment);
750+
}
751+
752+
uint32_t max_id = 0;
753+
for (const auto& [id, _] : plan.segments()) {
754+
if (id > max_id) {
755+
max_id = id;
756+
}
757+
}
758+
759+
uint32_t next_id = max_id + 1;
760+
auto* plan_segments = plan.mutable_segments();
761+
for (const SubsetDefinition& def : table_keyed_segments) {
762+
GlyphSegmentation::SubsetDefinitionToSegment(def,
763+
(*plan_segments)[next_id]);
764+
SegmentsProto* segment_ids = plan.add_non_glyph_segments();
765+
segment_ids->add_values(next_id);
766+
next_id++;
767+
}
768+
}
769+
737770
} // namespace ift::encoder

ift/encoder/closure_glyph_segmenter.h

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,15 @@
44
#include <optional>
55
#include <vector>
66

7+
#include "absl/container/btree_map.h"
78
#include "absl/status/statusor.h"
89
#include "ift/encoder/glyph_segmentation.h"
910
#include "ift/encoder/merge_strategy.h"
1011
#include "ift/encoder/segmentation_context.h"
1112
#include "ift/encoder/subset_definition.h"
1213
#include "ift/freq/probability_calculator.h"
1314
#include "util/common.pb.h"
15+
#include "util/segmentation_plan.pb.h"
1416
#include "util/segmenter_config.pb.h"
1517

1618
namespace ift::encoder {
@@ -89,6 +91,12 @@ class ClosureGlyphSegmenter {
8991
uint32_t& fallback_glyphs_size,
9092
uint32_t& all_glyphs_size) const;
9193

94+
static void AddTableKeyedSegments(
95+
SegmentationPlan& plan,
96+
const absl::btree_map<common::SegmentSet, MergeStrategy>& merge_groups,
97+
const std::vector<SubsetDefinition>& segments,
98+
const SubsetDefinition& init_segment);
99+
92100
private:
93101
uint32_t brotli_quality_;
94102
uint32_t init_font_merging_brotli_quality_;

util/BUILD

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -64,9 +64,15 @@ cc_binary(
6464
srcs = [
6565
"font2ift.cc",
6666
],
67+
data = [
68+
"@ift_encoder_data//:freq_data",
69+
],
6770
deps = [
71+
":auto_config_flags",
72+
":auto_segmenter_config",
6873
":load_codepoints",
6974
":segmentation_plan_cc_proto",
75+
":segmenter_config_util",
7076
"//common",
7177
"//ift",
7278
"//ift/encoder",
@@ -76,6 +82,7 @@ cc_binary(
7682
"@abseil-cpp//absl/status:statusor",
7783
"@abseil-cpp//absl/strings",
7884
"@harfbuzz",
85+
"//util:segmenter_config_cc_proto",
7986
],
8087
)
8188

@@ -103,6 +110,7 @@ cc_binary(
103110
"@ift_encoder_data//:freq_data",
104111
],
105112
deps = [
113+
":auto_config_flags",
106114
":auto_segmenter_config",
107115
":load_codepoints",
108116
":segmentation_plan_cc_proto",
@@ -138,6 +146,16 @@ cc_binary(
138146
],
139147
)
140148

149+
cc_library(
150+
name = "auto_config_flags",
151+
srcs = ["auto_config_flags.cc"],
152+
hdrs = ["auto_config_flags.h"],
153+
visibility = ["//visibility:public"],
154+
deps = [
155+
"@abseil-cpp//absl/flags:flag",
156+
],
157+
)
158+
141159
cc_library(
142160
name = "convert_iftb",
143161
srcs = [
@@ -203,10 +221,12 @@ cc_library(
203221
],
204222
deps = [
205223
":load_codepoints",
224+
":segmentation_plan_cc_proto",
206225
":segmenter_config_cc_proto",
207226
"//common",
208227
"//ift/encoder",
209228
"@abseil-cpp//absl/status:statusor",
229+
"@harfbuzz",
210230
],
211231
)
212232

util/auto_config_flags.cc

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
#include "util/auto_config_flags.h"
2+
3+
#include <string>
4+
5+
#include "absl/flags/flag.h"
6+
7+
ABSL_FLAG(int, auto_config_quality, 0,
8+
"The quality level to use when generating a segmenter config. A value of 0 "
9+
"means auto pick. Valid values are 1-8.");
10+
11+
ABSL_FLAG(std::string, auto_config_primary_script, "Script_latin",
12+
"When auto_config is enabled this sets the primary script or "
13+
"language frequency data file to use. "
14+
"The primary script is eligible to have codepoints moved to the init font. "
15+
"For CJK primary script can be used to specialize against a specific language/script.");

util/auto_config_flags.h

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
#ifndef UTIL_AUTO_CONFIG_FLAGS_H_
2+
#define UTIL_AUTO_CONFIG_FLAGS_H_
3+
4+
#include <string>
5+
6+
#include "absl/flags/declare.h"
7+
8+
ABSL_DECLARE_FLAG(int, auto_config_quality);
9+
ABSL_DECLARE_FLAG(std::string, auto_config_primary_script);
10+
11+
#endif // UTIL_AUTO_CONFIG_FLAGS_H_

util/auto_segmenter_config.cc

Lines changed: 18 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,6 @@
22

33
#include <cctype>
44
#include <string>
5-
#include <unordered_map>
65

76
#include "absl/container/flat_hash_set.h"
87
#include "absl/log/log.h"
@@ -50,12 +49,6 @@ enum Quality {
5049
MAX = 8, // Alias for EIGHT
5150
};
5251

53-
// TODO(garretrieger): define a very basic set of quality levels first (see next TODO),
54-
// start with just a lowest and highest to set the upper and lower bounds for quality
55-
// settings (maybe also a mid point). To begin use number of codepoints to select quality
56-
// level. Do some testing on segmentation times at low and high to get a sense of
57-
// how times are impacted.
58-
5952
// TODO(garretrieger): do something analagous to brotli quality levels
6053
// where we define a series of levels which correspond to a set of
6154
// values for the quality/performance tradeoff settings (including setting the
@@ -291,7 +284,7 @@ StatusOr<std::string> AutoSegmenterConfig::GetBaseScriptForLanguage(
291284
}
292285

293286
static const auto* lang_to_script =
294-
new std::unordered_map<std::string, std::string>{
287+
new flat_hash_map<std::string, std::string> {
295288
{"Language_af", "Script_latin"},
296289
{"Language_ak", "Script_latin"},
297290
{"Language_am", "Script_ethiopic"},
@@ -602,7 +595,7 @@ static void ApplyQualityLevelTo(Quality quality, SegmenterConfig& config) {
602595
}
603596
}
604597

605-
absl::StatusOr<SegmenterConfig> AutoSegmenterConfig::GenerateConfig(
598+
StatusOr<SegmenterConfig> AutoSegmenterConfig::GenerateConfig(
606599
hb_face_t* face, std::optional<std::string> primary_script, std::optional<int> quality_level) {
607600
SegmenterConfig config;
608601
config.set_generate_table_keyed_segments(true);
@@ -617,9 +610,22 @@ absl::StatusOr<SegmenterConfig> AutoSegmenterConfig::GenerateConfig(
617610
auto freq_list = TRY(BuiltInFrequenciesList());
618611
CodepointSet unicodes = FontHelper::ToCodepointsSet(face);
619612
uint32_t cp_count = unicodes.size();
620-
Quality quality = cp_count > 2000 ? MIN : MAX;
621-
if (quality_level.has_value() && quality_level.value() >= ONE && quality_level.value() <= MAX) {
613+
614+
// TODO(garretrieger): more sophisticated scheme for auto picking quality level.
615+
// roughly we want to estimate the expected cost of each quality level and pick
616+
// based on that.
617+
Quality quality = THREE;
618+
if (cp_count <= 1000) {
619+
quality = MAX;
620+
} else if (cp_count <= 3000) {
621+
quality_level = SIX;
622+
}
623+
624+
if (quality_level.has_value() && quality_level.value() >= MIN && quality_level.value() <= MAX) {
622625
quality = static_cast<Quality>(quality_level.value());
626+
VLOG(0) << "Using specified quality level for segmenting: " << quality;
627+
} else {
628+
VLOG(0) << "Quality level unspecified, auto picked: " << quality;
623629
}
624630

625631
// Detect scripts by intersection with frequency data
@@ -644,7 +650,6 @@ absl::StatusOr<SegmenterConfig> AutoSegmenterConfig::GenerateConfig(
644650

645651
cost->set_built_in_freq_data_name(script);
646652
if (script == primary_script_file) {
647-
// TODO(garretrieger): customize these values based on the quality level
648653
cost->set_initial_font_merge_threshold(-60);
649654
}
650655
}
@@ -654,4 +659,4 @@ absl::StatusOr<SegmenterConfig> AutoSegmenterConfig::GenerateConfig(
654659
return config;
655660
}
656661

657-
} // namespace util
662+
} // namespace util

0 commit comments

Comments
 (0)