Skip to content

ImportC C23: char8_t and u8'c' character constants #23230

Description

@PetarKirov

Part of #23223 (M1 — standard-independent).

Summary

C23 lets u8 prefix a character constant (u8'c'), not just a string literal, giving it the new char8_t type. This is purely additive: u8'...' is a syntax error in C11, so accepting it cannot change the meaning of any valid C11 program. ImportC should lex u8'c' and map char8_t to D's char (a UTF-8 byte). No CLI flag or standard-selection is needed.

Spec

N3220 6.4.4.5 Character constants (adds the u8 encoding-prefix and the char8_t constant type); the char8_t type itself is normatively "an unsigned integer type ... the same type as unsigned char" per 7.30 <uchar.h>, within the integer/character type model of 6.2.5 Types.

Grammar (N3220 6.4.4.5; opt = optional)

character-constant:
              encoding-prefix(opt) ' c-char-sequence '

encoding-prefix: one of
              u8     u     U     L

c-char-sequence:
              c-char
              c-char-sequence c-char

c-char:
              any member of the source character set except
                          the single-quote ', backslash \, or new-line character
              escape-sequence

escape-sequence:
              simple-escape-sequence
              octal-escape-sequence
              hexadecimal-escape-sequence
              universal-character-name

Constraints (N3220 6.4.4.5 p9-10):

  • The value of an octal/hexadecimal escape sequence shall be in range of the corresponding type; for the u8 prefix the corresponding type is char8_t.
  • "A UTF-8 ... character constant shall not contain more than one character. The value shall be representable with a single UTF-8 ... code unit, respectively." (footnote 73: "u8'ab' violates this constraint.")

Semantics relevant to the parser (N3220 6.4.4.5 p12): "A UTF-8 character constant has type char8_t. If the UTF-8 character constant is not produced through a hexadecimal or octal escape sequence, the value ... is equal to its ISO/IEC 10646 code point value, provided that the code point value can be encoded as a single UTF-8 code unit."

Approach

Two independent, additive pieces:

  1. char8_t type: provided as a typedef in druntime/src/importc.h (mapping to D char, i.e. a UTF-8 byte), matching how that header already supplies other implementation typedefs (druntime/src/importc.h:74). No new keyword and no frontend change is needed for the type name.

  2. u8'c' lexing: extend the existing C prefix-scanning block in compiler/src/dmd/lexer.d. The case 'u': case 'U': case 'L': block already handles wide char constants (compiler/src/dmd/lexer.d:497-499 calling clexerCharConstant) and already special-cases the u8"..." string literal (compiler/src/dmd/lexer.d:511-516). Add a sibling branch: when p[0]=='u' && p[1]=='8' && p[2]=='\'', advance past u8 and lex a single-code-unit UTF-8 character constant. Because char8_t is unsigned char, a u8'c' constant fits in a char byte; it can be emitted as TOK.charLiteral (which cparse.d maps to int/tint32, matching the integer-promoted use of a char8_t value) so no new token kind is required. Enforce the 6.4.4.5 p10 constraint (single UTF-8 code unit, value range of char8_t) and diagnose u8'ab' per footnote 73.

clexerCharConstant (compiler/src/dmd/lexer.d:2293) is the natural home for the new prefix handling; note its doc comment currently cites C11 6.4.4.4 — update it to C23 6.4.4.5 since C23 renumbered this clause and added the u8 prefix.

Change sites

  • compiler/src/dmd/lexer.d:511-516 — alongside the existing p[1]=='8' && p[2]=='"' (u8 string) check, add a p[1]=='8' && p[2]=='\'' (u8 char) check that advances past u8 and lexes a UTF-8 character constant. // C23 6.4.4.5
  • compiler/src/dmd/lexer.d:2293 (clexerCharConstant) — add a '8'/u8 case that decodes exactly one UTF-8 code unit, enforces the single-code-unit constraint, and updates the doc comment from C11 6.4.4.4 to C23 6.4.4.5. // C23 6.4.4.5
  • druntime/src/importc.h:74 — add typedef unsigned char char8_t; near the other implementation typedefs. // C23 6.2.5, 7.30

Out of scope / deferred

  • u8"..." UTF-8 string literals: already supported (compiler/src/dmd/lexer.d:511-516).
  • char16_t / char32_t and u/U char constants: already handled by clexerCharConstant; no change here.
  • Any <uchar.h> library surface (mbrtoc8, char8_t macro versioning): library, not the ImportC parser.
  • Strict re-typing of u8'c' to a distinct char8_t-flavored token: deferred; integer-promoted int/charLiteral value suffices. Note here if a concrete need arises.

Acceptance criteria

  • compilable test exercising u8'a', a char8_t variable initialized from it, and a u8'\xNN' escape, each with its // C23 6.4.4.5 comment.
  • fail_compilation test for the error path: u8'ab' (more than one code unit, 6.4.4.5 p10 / footnote 73), with // C23 6.4.4.5.
  • Changelog entry (changelog/dmd.importc-c23-char8.dd).
  • No regression for existing tests (notably the u8"..." string-literal and u/U/L char-constant paths).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions