Skip to content

Commit 1a6760f

Browse files
authored
Support UTF-8 string literals
1 parent 2a48db6 commit 1a6760f

1 file changed

Lines changed: 26 additions & 3 deletions

File tree

standard/lexical-structure.md

Lines changed: 26 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -920,14 +920,16 @@ A verbatim string literal consists of an `@` character followed by a double-quo
920920
921921
In a verbatim string literal, the characters between the delimiters are interpreted verbatim, with the only exception being a *Quote_Escape_Sequence*, which represents one double-quote character. In particular, simple escape sequences, and hexadecimal and Unicode escape sequences are not processed in verbatim string literals. A verbatim string literal may span multiple lines.
922922
923+
All string literal forms may optionally have a trailing *Utf8_Suffix*. The representation of each form is discussed below.
924+
923925
```ANTLR
924926
String_Literal
925927
: Regular_String_Literal
926928
| Verbatim_String_Literal
927929
;
928930
929931
fragment Regular_String_Literal
930-
: '"' Regular_String_Literal_Character* '"'
932+
: '"' Regular_String_Literal_Character* '"' Utf8_Suffix?
931933
;
932934
933935
fragment Regular_String_Literal_Character
@@ -943,7 +945,7 @@ fragment Single_Regular_String_Literal_Character
943945
;
944946
945947
fragment Verbatim_String_Literal
946-
: '@"' Verbatim_String_Literal_Character* '"'
948+
: '@"' Verbatim_String_Literal_Character* '"' Utf8_Suffix?
947949
;
948950
949951
fragment Verbatim_String_Literal_Character
@@ -958,6 +960,10 @@ fragment Single_Verbatim_String_Literal_Character
958960
fragment Quote_Escape_Sequence
959961
: '""'
960962
;
963+
964+
fragment Utf8_Suffix
965+
: 'u8' | 'U8'
966+
;
961967
```
962968
963969
> *Example*: The example
@@ -990,7 +996,24 @@ fragment Quote_Escape_Sequence
990996
<!-- markdownlint-enable MD028 -->
991997
> *Note*: Since a hexadecimal escape sequence can have a variable number of hex digits, the string literal `"\x123"` contains a single character with hex value `123`. To create a string containing the character with hex value `12` followed by the character `3`, one could write `"\x00123"` or `"\x12"+ `"3"` instead. *end note*
992998
993-
The type of a *String_Literal* is `string`.
999+
A *String_Literal* that does not contain a *Utf8_Suffix* is a ***UTF-16 string literal***, whose type is `string`.
1000+
1001+
A *String_Literal* that contains a *Utf8_Suffix* is a ***UTF-8 string literal***, whose type is `System.ReadOnlySpan<byte>` (an indexable collection type), and whose value contains a UTF-8 byte representation of the string. A null terminator (a byte with value zero) is placed beyond the last byte in memory (and outside the length of the `ReadOnlySpan<byte>`) in order to support scenarios that expect null-terminated byte strings. A UTF-8 string literal is not a constant. A UTF-8 string literal without its *Utf8_Suffix* shall be valid UTF-16. (For example, `"\uDC00\uDD00"u8` is ill-formed as one low surrogate cannot be followed by another.)
1002+
1003+
> *Note*: While every UTF-8 string literal is a `ReadOnlySpan<byte>`, not every `ReadOnlySpan<byte>` represents a UTF-8 string literal. See the description of UTF-8 string concatenation in12.13.5](expressions.md#12135-addition-operator). *end note*
1004+
<!-- markdownlint-disable MD028 -->
1005+
1006+
<!-- markdownlint-enable MD028 -->
1007+
> *Note*: As `ReadOnlySpan<byte>` is a ref struct type, a UTF-8 string literal cannot be converted to `object` or used as a type parameter ([§16.2.3]( structs.md#1623-ref-modifier)). *end note*
1008+
<!-- markdownlint-disable MD028 -->
1009+
1010+
<!-- markdownlint-enable MD028 -->
1011+
> *Example*: Here are examples of each form of string literal:
1012+
> | **Encoding** | **Type** | **Regular String Literal** | **Verbatim String Literal** | **Raw String Literal** |
1013+
> |--------------|----------------------|---------------------|--------------------|--------------------|
1014+
> | UTF-16 | `string` | `"Hello"` | `@"Hello"` | `"""Hello"""` |
1015+
> | UTF-8 | `ReadOnlySpan<byte>` | `"Hello"u8` | `@"Hello"u8` | `"""Hello"""u8` |
1016+
> *end example*
9941017
9951018
Each string literal does not necessarily result in a new string instance. When two or more string literals that are equivalent according to the string equality operator ([§12.15.8](expressions.md#12158-string-equality-operators)), appear in the same assembly, these string literals refer to the same string instance.
9961019

0 commit comments

Comments
 (0)