Part of #23223 (M1 — standard-independent).
Summary
C23 lets u8 prefix a character constant (u8'c'), not just a string literal, giving it the new char8_t type. This is purely additive: u8'...' is a syntax error in C11, so accepting it cannot change the meaning of any valid C11 program. ImportC should lex u8'c' and map char8_t to D's char (a UTF-8 byte). No CLI flag or standard-selection is needed.
Spec
N3220 6.4.4.5 Character constants (adds the u8 encoding-prefix and the char8_t constant type); the char8_t type itself is normatively "an unsigned integer type ... the same type as unsigned char" per 7.30 <uchar.h>, within the integer/character type model of 6.2.5 Types.
Grammar (N3220 6.4.4.5; opt = optional)
character-constant:
encoding-prefix(opt) ' c-char-sequence '
encoding-prefix: one of
u8 u U L
c-char-sequence:
c-char
c-char-sequence c-char
c-char:
any member of the source character set except
the single-quote ', backslash \, or new-line character
escape-sequence
escape-sequence:
simple-escape-sequence
octal-escape-sequence
hexadecimal-escape-sequence
universal-character-name
Constraints (N3220 6.4.4.5 p9-10):
- The value of an octal/hexadecimal escape sequence shall be in range of the corresponding type; for the
u8 prefix the corresponding type is char8_t.
- "A UTF-8 ... character constant shall not contain more than one character. The value shall be representable with a single UTF-8 ... code unit, respectively." (footnote 73: "u8'ab' violates this constraint.")
Semantics relevant to the parser (N3220 6.4.4.5 p12): "A UTF-8 character constant has type char8_t. If the UTF-8 character constant is not produced through a hexadecimal or octal escape sequence, the value ... is equal to its ISO/IEC 10646 code point value, provided that the code point value can be encoded as a single UTF-8 code unit."
Approach
Two independent, additive pieces:
-
char8_t type: provided as a typedef in druntime/src/importc.h (mapping to D char, i.e. a UTF-8 byte), matching how that header already supplies other implementation typedefs (druntime/src/importc.h:74). No new keyword and no frontend change is needed for the type name.
-
u8'c' lexing: extend the existing C prefix-scanning block in compiler/src/dmd/lexer.d. The case 'u': case 'U': case 'L': block already handles wide char constants (compiler/src/dmd/lexer.d:497-499 calling clexerCharConstant) and already special-cases the u8"..." string literal (compiler/src/dmd/lexer.d:511-516). Add a sibling branch: when p[0]=='u' && p[1]=='8' && p[2]=='\'', advance past u8 and lex a single-code-unit UTF-8 character constant. Because char8_t is unsigned char, a u8'c' constant fits in a char byte; it can be emitted as TOK.charLiteral (which cparse.d maps to int/tint32, matching the integer-promoted use of a char8_t value) so no new token kind is required. Enforce the 6.4.4.5 p10 constraint (single UTF-8 code unit, value range of char8_t) and diagnose u8'ab' per footnote 73.
clexerCharConstant (compiler/src/dmd/lexer.d:2293) is the natural home for the new prefix handling; note its doc comment currently cites C11 6.4.4.4 — update it to C23 6.4.4.5 since C23 renumbered this clause and added the u8 prefix.
Change sites
compiler/src/dmd/lexer.d:511-516 — alongside the existing p[1]=='8' && p[2]=='"' (u8 string) check, add a p[1]=='8' && p[2]=='\'' (u8 char) check that advances past u8 and lexes a UTF-8 character constant. // C23 6.4.4.5
compiler/src/dmd/lexer.d:2293 (clexerCharConstant) — add a '8'/u8 case that decodes exactly one UTF-8 code unit, enforces the single-code-unit constraint, and updates the doc comment from C11 6.4.4.4 to C23 6.4.4.5. // C23 6.4.4.5
druntime/src/importc.h:74 — add typedef unsigned char char8_t; near the other implementation typedefs. // C23 6.2.5, 7.30
Out of scope / deferred
u8"..." UTF-8 string literals: already supported (compiler/src/dmd/lexer.d:511-516).
char16_t / char32_t and u/U char constants: already handled by clexerCharConstant; no change here.
- Any
<uchar.h> library surface (mbrtoc8, char8_t macro versioning): library, not the ImportC parser.
- Strict re-typing of
u8'c' to a distinct char8_t-flavored token: deferred; integer-promoted int/charLiteral value suffices. Note here if a concrete need arises.
Acceptance criteria
Part of #23223 (M1 — standard-independent).
Summary
C23 lets
u8prefix a character constant (u8'c'), not just a string literal, giving it the newchar8_ttype. This is purely additive:u8'...'is a syntax error in C11, so accepting it cannot change the meaning of any valid C11 program. ImportC should lexu8'c'and mapchar8_tto D'schar(a UTF-8 byte). No CLI flag or standard-selection is needed.Spec
N3220 6.4.4.5 Character constants (adds the
u8encoding-prefix and thechar8_tconstant type); thechar8_ttype itself is normatively "an unsigned integer type ... the same type asunsigned char" per 7.30<uchar.h>, within the integer/character type model of 6.2.5 Types.Grammar (N3220 6.4.4.5;
opt= optional)Constraints (N3220 6.4.4.5 p9-10):
u8prefix the corresponding type ischar8_t.Semantics relevant to the parser (N3220 6.4.4.5 p12): "A UTF-8 character constant has type
char8_t. If the UTF-8 character constant is not produced through a hexadecimal or octal escape sequence, the value ... is equal to its ISO/IEC 10646 code point value, provided that the code point value can be encoded as a single UTF-8 code unit."Approach
Two independent, additive pieces:
char8_ttype: provided as a typedef indruntime/src/importc.h(mapping to Dchar, i.e. a UTF-8 byte), matching how that header already supplies other implementation typedefs (druntime/src/importc.h:74). No new keyword and no frontend change is needed for the type name.u8'c'lexing: extend the existing C prefix-scanning block incompiler/src/dmd/lexer.d. Thecase 'u': case 'U': case 'L':block already handles wide char constants (compiler/src/dmd/lexer.d:497-499callingclexerCharConstant) and already special-cases theu8"..."string literal (compiler/src/dmd/lexer.d:511-516). Add a sibling branch: whenp[0]=='u' && p[1]=='8' && p[2]=='\'', advance pastu8and lex a single-code-unit UTF-8 character constant. Becausechar8_tisunsigned char, au8'c'constant fits in acharbyte; it can be emitted asTOK.charLiteral(whichcparse.dmaps toint/tint32, matching the integer-promoted use of achar8_tvalue) so no new token kind is required. Enforce the 6.4.4.5 p10 constraint (single UTF-8 code unit, value range ofchar8_t) and diagnoseu8'ab'per footnote 73.clexerCharConstant(compiler/src/dmd/lexer.d:2293) is the natural home for the new prefix handling; note its doc comment currently citesC11 6.4.4.4— update it toC23 6.4.4.5since C23 renumbered this clause and added theu8prefix.Change sites
compiler/src/dmd/lexer.d:511-516— alongside the existingp[1]=='8' && p[2]=='"'(u8 string) check, add ap[1]=='8' && p[2]=='\''(u8 char) check that advances pastu8and lexes a UTF-8 character constant. // C23 6.4.4.5compiler/src/dmd/lexer.d:2293(clexerCharConstant) — add a'8'/u8 case that decodes exactly one UTF-8 code unit, enforces the single-code-unit constraint, and updates the doc comment fromC11 6.4.4.4toC23 6.4.4.5. // C23 6.4.4.5druntime/src/importc.h:74— addtypedef unsigned char char8_t;near the other implementation typedefs. // C23 6.2.5, 7.30Out of scope / deferred
u8"..."UTF-8 string literals: already supported (compiler/src/dmd/lexer.d:511-516).char16_t/char32_tandu/Uchar constants: already handled byclexerCharConstant; no change here.<uchar.h>library surface (mbrtoc8,char8_tmacro versioning): library, not the ImportC parser.u8'c'to a distinctchar8_t-flavored token: deferred; integer-promotedint/charLiteralvalue suffices. Note here if a concrete need arises.Acceptance criteria
compilabletest exercisingu8'a', achar8_tvariable initialized from it, and au8'\xNN'escape, each with its// C23 6.4.4.5comment.fail_compilationtest for the error path:u8'ab'(more than one code unit, 6.4.4.5 p10 / footnote 73), with// C23 6.4.4.5.changelog/dmd.importc-c23-char8.dd).u8"..."string-literal andu/U/Lchar-constant paths).