11# string-offsets
22
3- This crate converts string positions between Rust style (UTF-8 byte offsets) and styles used by other programming languages, as well as line numbers.
3+ Offset calculator to convert between byte, char, and line offsets in a string.
4+
5+ Rust strings are UTF-8, but JavaScript has UTF-16 strings, and in Python, strings are sequences of
6+ Unicode code points. It's therefore necessary to adjust string offsets when communicating across
7+ programming language boundaries. [ ` StringOffsets ` ] does these adjustments.
8+
9+ Each ` StringOffsets ` value contains offset information for a single string. [ Building the data
10+ structure] ( StringOffsets::new ) takes O(n) time and memory, but then each conversion is fast.
11+
12+ [ "UTF-8 Conversions with BitRank"] ( https://adaptivepatchwork.com/2023/07/10/utf-conversion/ ) is a
13+ blog post explaining the implementation.
414
515## Usage
616
@@ -10,3 +20,26 @@ Add this to your `Cargo.toml`:
1020[dependencies ]
1121string-offsets = " 0.1"
1222```
23+
24+ Then:
25+
26+ ``` rust
27+ use string_offsets :: StringOffsets ;
28+
29+ let s = " ☀️hello\ n 🗺️world\ n" ;
30+ let offsets = StringOffsets :: new (s );
31+
32+ // Find offsets where lines begin and end.
33+ assert_eq! (offsets . line_to_utf8s (0 ), 0 .. 12 ); // note: 0-based line numbers
34+
35+ // Translate string offsets between UTF-8 and other encodings.
36+ // This map emoji is 7 UTF-8 bytes...
37+ assert_eq! (& s [12 .. 19 ], " 🗺️" );
38+ // ...but only 3 UTF-16 code units...
39+ assert_eq! (offsets . utf8_to_utf16 (12 ), 8 );
40+ assert_eq! (offsets . utf8_to_utf16 (19 ), 11 );
41+ // ...and only 2 Unicode characters.
42+ assert_eq! (offsets . utf8s_to_chars (12 .. 19 ), 8 .. 10 );
43+ ```
44+
45+ See [ the documentation] ( https://docs.rs/string-offsets/latest/string_offsets/struct.StringOffsets.html ) for more.
0 commit comments