Skip to content

Commit 5047856

Browse files
RexJaeschkeBillWagner
authored andcommitted
Support UTF-8 string literals
1 parent cf64a32 commit 5047856

1 file changed

Lines changed: 26 additions & 3 deletions

File tree

standard/lexical-structure.md

Lines changed: 26 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -906,14 +906,16 @@ A verbatim string literal consists of an `@` character followed by a double-quo
906906
907907
In a verbatim string literal, the characters between the delimiters are interpreted verbatim, with the only exception being a *Quote_Escape_Sequence*, which represents one double-quote character. In particular, simple escape sequences, and hexadecimal and Unicode escape sequences are not processed in verbatim string literals. A verbatim string literal may span multiple lines.
908908
909+
All string literal forms may optionally have a trailing *Utf8_Suffix*. The representation of each form is discussed below.
910+
909911
```ANTLR
910912
String_Literal
911913
: Regular_String_Literal
912914
| Verbatim_String_Literal
913915
;
914916
915917
fragment Regular_String_Literal
916-
: '"' Regular_String_Literal_Character* '"'
918+
: '"' Regular_String_Literal_Character* '"' Utf8_Suffix?
917919
;
918920
919921
fragment Regular_String_Literal_Character
@@ -929,7 +931,7 @@ fragment Single_Regular_String_Literal_Character
929931
;
930932
931933
fragment Verbatim_String_Literal
932-
: '@"' Verbatim_String_Literal_Character* '"'
934+
: '@"' Verbatim_String_Literal_Character* '"' Utf8_Suffix?
933935
;
934936
935937
fragment Verbatim_String_Literal_Character
@@ -944,6 +946,10 @@ fragment Single_Verbatim_String_Literal_Character
944946
fragment Quote_Escape_Sequence
945947
: '""'
946948
;
949+
950+
fragment Utf8_Suffix
951+
: 'u8' | 'U8'
952+
;
947953
```
948954
949955
> *Example*: The example
@@ -976,7 +982,24 @@ fragment Quote_Escape_Sequence
976982
<!-- markdownlint-enable MD028 -->
977983
> *Note*: Since a hexadecimal escape sequence can have a variable number of hex digits, the string literal `"\x123"` contains a single character with hex value `123`. To create a string containing the character with hex value `12` followed by the character `3`, one could write `"\x00123"` or `"\x12"+ `"3"` instead. *end note*
978984
979-
The type of a *String_Literal* is `string`.
985+
A *String_Literal* that does not contain a *Utf8_Suffix* is a ***UTF-16 string literal***, whose type is `string`.
986+
987+
A *String_Literal* that contains a *Utf8_Suffix* is a ***UTF-8 string literal***, whose type is `System.ReadOnlySpan<byte>` (an indexable collection type), and whose value contains a UTF-8 byte representation of the string. A null terminator (a byte with value zero) is placed beyond the last byte in memory (and outside the length of the `ReadOnlySpan<byte>`) in order to support scenarios that expect null-terminated byte strings. A UTF-8 string literal is not a constant. A UTF-8 string literal without its *Utf8_Suffix* shall be valid UTF-16. (For example, `"\uDC00\uDD00"u8` is ill-formed as one low surrogate cannot be followed by another.)
988+
989+
> *Note*: While every UTF-8 string literal is a `ReadOnlySpan<byte>`, not every `ReadOnlySpan<byte>` represents a UTF-8 string literal. See the description of UTF-8 string concatenation in12.13.5](expressions.md#12135-addition-operator). *end note*
990+
<!-- markdownlint-disable MD028 -->
991+
992+
<!-- markdownlint-enable MD028 -->
993+
> *Note*: As `ReadOnlySpan<byte>` is a ref struct type, a UTF-8 string literal cannot be converted to `object` or used as a type parameter ([§16.2.3]( structs.md#1623-ref-modifier)). *end note*
994+
<!-- markdownlint-disable MD028 -->
995+
996+
<!-- markdownlint-enable MD028 -->
997+
> *Example*: Here are examples of each form of string literal:
998+
> | **Encoding** | **Type** | **Regular String Literal** | **Verbatim String Literal** | **Raw String Literal** |
999+
> |--------------|----------------------|---------------------|--------------------|--------------------|
1000+
> | UTF-16 | `string` | `"Hello"` | `@"Hello"` | `"""Hello"""` |
1001+
> | UTF-8 | `ReadOnlySpan<byte>` | `"Hello"u8` | `@"Hello"u8` | `"""Hello"""u8` |
1002+
> *end example*
9801003
9811004
Each string literal does not necessarily result in a new string instance. When two or more string literals that are equivalent according to the string equality operator ([§12.15.8](expressions.md#12158-string-equality-operators)), appear in the same assembly, these string literals refer to the same string instance.
9821005

0 commit comments

Comments
 (0)