2 Unicode Strings File Format

EDK II Unicode files are used for mapping token names to localized strings that are identified by an RFC4646 language code. The format for storing EDK II Unicode files on disk is UTF-8 (without a BOM character) or UTF-16LE (with a BOM character). The character content must be UCS-2.

Strings ends are determined by the first of the following items found:

  • a control character
  • a comment
  • the end of the file
  • a blank line

Comments may appear anywhere within the string file.

All UTF-16LE files must begin with a Unicode BOM character. All UTF-8 files must not begin with a Unicode BOM character.


NOTE: Please make sure you select an editor that supports UCS-2 characters that can be stored in either a UTF-8 (without a BOM character) or a UTF-16LE file (with a BOM character).


2.1 Common EBNF

The following EBNF uses quoted (double quotes) encapsulated characters to represent UCS-2 string literals. In the following definitions, the semi-colon is used to denote a comment.

<US>           ::= " "
<Letter>       ::= {(\u0041-\u005A)} ; Characters A - Z
                   {(\u0061-\u007A)} ; Characters a - z
<Digit>        ::= (\u0030-\u0039)   ; Characters 0 - 9
<MS>           ::= <US>+
<ME>           ::= {<MS>} {<EOL>}
<CommentLine>  ::= "//" <US>* <PCHars> <EOL>
<BlankLine>    ::= <EOL>
<Chars>        ::= (\u0001-\uF6FF)
<PChars>       ::= {(\u0020-\uF6FF)} {<OpChar>}
<OpChars>      ::= "\x" [{<Letter>} {<Digit>}]{4} "\"
<VChars>       ::= (\u0021-\uF6FF)
<UnicodeLines> ::= <Token> <ME>
                   [<Ldef> [<String> <ME>]+]+
<Ldef>         ::= <CtrlChar> "language" <MS> <LangCode> <ME>
<HexDigit>     ::= {<Digit>}
                   {(\u0041-\u0046)} ; Characters A - F
                   {(\u0061-\u0066)} ; Characters a - f
<CtrlChar>     ::= <US>* "#"
<Token>        ::= <CtrlChar> "string" <MS> <Identifier>
<Identifier>   ::= <Letter> [{<Letter>} {<Digit>} {<UN>}]*
<LangCode>     ::= <RFC4646>
<RFC4646>      ::= <Letter>{2,8} [<ShortExt> <LongExt>*]
<ShortExt>     ::= "-" [{<Letter>} {<Digit>}]{1,8}
<LongExt>      ::= "-" [{<Letter>} {<Digit>}]{1,}
<UDblQuote>    ::= \u0022  ; Double Quote Character, "
<String>       ::= <UDblQuote> <SContent>* <UDblQuote>
<SContent>     ::= {<PChars>} {<Attributes>}
<Attributes>   ::= "\" {"narrow"} {"wide"} {<UDblQuote>}
                   {"n"} {"r"} {"t"} {"nbr"} {"\"} {"'"}

2.1.1 Definitions

LanguageCodes

The language code must be a valid RFC4646 language code.

EscChar

In order to include some standard characters, such as the "\" back-slash character within a string, the character must be prefixed with the escape character. Characters that may require a prefixed escape character include the following, back slash "\" character, single-quote "'" character, double-quote '"' character and the forward slash "/" character. The back slash always requires the escape character.

Token

The token (strong identifier) may only contain numbers, upper and lower case letters, underscore character, and dash character.

Include

An include line is used to parse another file, also compliant with this specification, as if it was in the file. The tokens should not overlap between the file for the same language.