You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: crates/bpe/README.md
+2-1Lines changed: 2 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -203,7 +203,7 @@ We benchmarked the following scenarios:
203
203
The data structure we built specifically for this purpose can answer those interval counting requests in typically constant times after the initial linear preprocessing of the text.
204
204
This mode is not available in tiktoken, which only supports counting/encoding a complete text.
205
205
206
-
All benchmarks were run single-threaded on a MacBook Pro M1.
206
+
All benchmarks were run single-threaded on a MacBook Air M4.
207
207
208
208
### Encoding
209
209
@@ -219,6 +219,7 @@ Two additional encoders are included that are faster but deviate from the origin
219
219
220
220
- The greedy encoder picks the left-longest token.
221
221
- The minimal encoder computes an encoding with the minimal number of tokens.
222
+
- The minimal_dropout encoder implements BPE-Dropout [algorithm](https://arxiv.org/abs/1910.13267), randomly ignoring some multi-byte tokens at runtime. Note that this implementation differs from the paper, and **has not** been tested in an actual language model training pipeline.
222
223
223
224
The benchmark measured the runtime of encoding of slices of lengths 10, 100, 1000, and 10000 from a random 20000 token original text using the o200k token set.
224
225
(All encodings were computed from scratch for each slice.)
/// This function computes the encoding while randomly rejecting some merges.
555
+
/// Result of the encoding will be non-deterministic unless `seed` is provided.
556
+
/// Implementation loosely follows original BPE dropout paper: https://arxiv.org/abs/1910.13267
557
+
///
558
+
/// In more detail: the tokenization uses dynamic programming, i.e. it models the tokenization as a graph,
559
+
/// where every position between text bytes is a node and two nodes are connected when the text slice between those two nodes matches a token.
560
+
// It then tries to find the shortest possible path from the beginning of the text till the end, i.e. it finds the shortest possible encoding.
561
+
// For this nodes are processed from right to left. At each node, edges starting at that node and ending on the right are tested and
562
+
// the one producing the shortest path is stored together with the length of the shortest path to that node.
563
+
// The length of the shortest path is stored as second value, the edge (or rather token) is stored as first value.
564
+
// Then, we walk in reverse direction through the table along the shortest path.
565
+
// Note: the reason for constructing the table from back to front is that
566
+
// the reconstruction outputs the path from start till end (i.e. we don't have to reverse the path afterwards).
567
+
//
568
+
// For the dropout (when dropout > 0.0), we uniformly drop edges from the graph, but always keep the one-byte tokens such that the graph stays connected.
569
+
// Note: this is very different from how BPE works and cannot produce the same output as the algorithm
570
+
// in the [paper's repository](https://github.com/VProv/BPE-Dropout/blob/master/bpe.py#L98), for two main reasons:
571
+
// - `encode_minimal` already doesn't follow the original heap-based BPE procedure
572
+
// - BPE-dropout authors discard all multi-byte tokens for each word separately, while this implementation does not split the "sentence" into words first
573
+
// and hence may include previously discarded token later down the byte stream. At the sentence level though we don't expect it to make much difference.
574
+
// Also, this implementation of BPE constructs merges on the fly from the set of tokens, hence might come up with a different set of merges with the same dictionary.
0 commit comments