Skip to content

Commit efaf552

Browse files
author
Hendrik van Antwerpen
committed
Update README example
1 parent b3feef9 commit efaf552

File tree

1 file changed

+17
-16
lines changed

1 file changed

+17
-16
lines changed

crates/bpe/README.md

Lines changed: 17 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -96,31 +96,32 @@ Given a valid encoding sequence `e_0..e_i` and a valid encoding tuple `e_i e_j`,
9696
## Novel Algorithm
9797

9898
At a first glance, it seems impossible to achieve `O(n)` complexity while preserving the encoding output of the original BPE algorithm, since the original BPE algorithm needs to first scan the full input before it can make any encoding decision.
99-
For instance, the sequence `abac` would be encoded as `ab ac` when the dictionary contains the tokens `a b c ab cb ac` ordered by frequency. But appending a single character `abacb` would result in a pretty different tokenization: `ab a cb`. So without looking ahead it seems impossible to properly tokenize the text.
99+
For instance, the sequence `abacb` would be encoded as `ab a cb` when the dictionary contains the tokens `a b c ab cb ac bb cbb acbb` ordered by frequency. But appending a single character `abacbb` would result in a pretty different tokenization: `ab acbb`. So without looking ahead it seems impossible to properly tokenize the text.
100100

101-
The solution is to track the encodings of ALL text prefixes. For our example `abacb` we would get:
101+
The solution is to track the encodings of ALL text prefixes. For our example `abacbb` we would get:
102102

103-
- `a` ------> `a`
104-
- `ab` -----> `ab`
105-
- `aba` ----> `ab a`
106-
- `abac` ---> `ab ac`
107-
- `abacb` --> `ab a cb`
103+
- `a` -------> `a`
104+
- `ab` ------> `ab`
105+
- `aba` -----> `ab a`
106+
- `abac` ----> `ab ac`
107+
- `abacb` ---> `ab a cb`
108+
- `abacbb` --> `ab acbb`
108109

109110
This can be done much more efficiently thanks to Corollary IIa, since now only the last token of every prefix has to be remembered:
110111

111-
- `a` ------> `a`
112-
- `ab` -----> `ab`
113-
- `aba` ----> `a`
114-
- `abac` ---> `ac`
115-
- `abacb` --> `cb`
112+
- `a` -------> `a`
113+
- `ab` ------> `ab`
114+
- `aba` -----> `a`
115+
- `abac` ----> `ac`
116+
- `abacb` ---> `cb`
117+
- `abacbb` --> `acbb`
116118

117119
In order to reconstruct the full encoding for a specific prefix, one simply starts with the last token of that prefix, shortens the prefix by the extracted token and looks up the token associated with the shortened prefix and so on until the beginning of the text is reached.
118120

119-
For our example prefix `abacb`, this procedure executes the following steps and determines the correct encoding in reverse order:
121+
For our example prefix `abacbb`, this procedure executes the following steps and determines the correct encoding in reverse order:
120122

121-
- `abacb` -> `cb`
122-
- `aba` ---> `a`
123-
- `ab` ----> `ab`
123+
- `abacbb` --> `acbb`
124+
- `ab` ------> `ab`
124125
- `<empty>`
125126

126127
The actual challenge is to determine for every prefix this last token efficiently.

0 commit comments

Comments
 (0)