Skip to content

Commit ef32575

Browse files
committed
Flesh out string-offsets README
1 parent fd056fb commit ef32575

File tree

2 files changed

+56
-1
lines changed

2 files changed

+56
-1
lines changed

crates/string-offsets/README.md

Lines changed: 34 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,16 @@
11
# string-offsets
22

3-
This crate converts string positions between Rust style (UTF-8 byte offsets) and styles used by other programming languages, as well as line numbers.
3+
Offset calculator to convert between byte, char, and line offsets in a string.
4+
5+
Rust strings are UTF-8, but JavaScript has UTF-16 strings, and in Python, strings are sequences of
6+
Unicode code points. It's therefore necessary to adjust string offsets when communicating across
7+
programming language boundaries. [`StringOffsets`] does these adjustments.
8+
9+
Each `StringOffsets` value contains offset information for a single string. [Building the data
10+
structure](StringOffsets::new) takes O(n) time and memory, but then each conversion is fast.
11+
12+
["UTF-8 Conversions with BitRank"](https://adaptivepatchwork.com/2023/07/10/utf-conversion/) is a
13+
blog post explaining the implementation.
414

515
## Usage
616

@@ -10,3 +20,26 @@ Add this to your `Cargo.toml`:
1020
[dependencies]
1121
string-offsets = "0.1"
1222
```
23+
24+
Then:
25+
26+
```rust
27+
use string_offsets::StringOffsets;
28+
29+
let s = "☀️hello\n🗺️world\n";
30+
let offsets = StringOffsets::new(s);
31+
32+
// Find offsets where lines begin and end.
33+
assert_eq!(offsets.line_to_utf8s(0), 0..12); // note: 0-based line numbers
34+
35+
// Translate string offsets between UTF-8 and other encodings.
36+
// This map emoji is 7 UTF-8 bytes...
37+
assert_eq!(&s[12..19], "🗺️");
38+
// ...but only 3 UTF-16 code units...
39+
assert_eq!(offsets.utf8_to_utf16(12), 8);
40+
assert_eq!(offsets.utf8_to_utf16(19), 11);
41+
// ...and only 2 Unicode characters.
42+
assert_eq!(offsets.utf8s_to_chars(12..19), 8..10);
43+
```
44+
45+
See [the documentation](https://docs.rs/string-offsets/latest/string_offsets/struct.StringOffsets.html) for more.

crates/string-offsets/src/lib.rs

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,27 @@
11
//! Offset calculator to convert between byte, char, and line offsets in a string.
22
//!
3+
//!
4+
//! # Example
5+
//!
6+
//! ```
7+
//! use string_offsets::StringOffsets;
8+
//!
9+
//! let s = "☀️hello\n🗺️world\n";
10+
//! let offsets = StringOffsets::new(s);
11+
//!
12+
//! // Find offsets where lines begin and end.
13+
//! assert_eq!(offsets.line_to_utf8s(0), 0..12); // note: 0-based line numbers
14+
//!
15+
//! // Translate string offsets between UTF-8 and other encodings.
16+
//! // This map emoji is 7 UTF-8 bytes...
17+
//! assert_eq!(&s[12..19], "🗺️");
18+
//! // ...but only 3 UTF-16 code units...
19+
//! assert_eq!(offsets.utf8_to_utf16(12), 8);
20+
//! assert_eq!(offsets.utf8_to_utf16(19), 11);
21+
//! // ...and only 2 Unicode characters.
22+
//! assert_eq!(offsets.utf8s_to_chars(12..19), 8..10);
23+
//! ```
24+
//!
325
//! See [`StringOffsets`] for details.
426
#![deny(missing_docs)]
527

0 commit comments

Comments
 (0)