![]() |
Technical Reports |
Version | Unicode 16.0.0 |
Editors | Ken Whistler |
Date | 2024-08-27 |
This Version | https://www.unicode.org/reports/tr44/tr44-34.html |
Previous Version | https://www.unicode.org/reports/tr44/tr44-32.html |
Latest Version | https://www.unicode.org/reports/tr44/ |
Latest Proposed Update | https://www.unicode.org/reports/tr44/proposed.html |
Revision | 34 |
This annex provides the core documentation for the Unicode Character Database (UCD). It describes the layout and organization of the Unicode Character Database and how it specifies the formal definitions of the Unicode Character Properties.
This document has been reviewed by Unicode members and other interested parties, and has been approved for publication by the Unicode Consortium. This is a stable document and may be used as reference material or cited as a normative reference by other specifications.
A Unicode Standard Annex (UAX) forms an integral part of the Unicode Standard, but is published online as a separate document. The Unicode Standard may require conformance to normative content in a Unicode Standard Annex, if so specified in the Conformance chapter of that version of the Unicode Standard. The version number of a UAX document corresponds to the version of the Unicode Standard of which it forms a part.
Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this annex is found in Unicode Standard Annex #41, “Common References for Unicode Standard Annexes.” For the latest version of the Unicode Standard, see [Unicode]. For a list of current Unicode Technical Reports, see [Reports]. For more information about versions of the Unicode Standard, see [Versions]. For any errata which may apply to this annex, see [Errata].
Note: the information in this annex is not intended as an exhaustive description of the use and interpretation of Unicode character properties and behavior. It must be used in conjunction with the data in the other files in the Unicode Character Database, and relies on the notation and definitions supplied in The Unicode Standard. All chapter references are to Version 16.0.0 of the standard unless otherwise indicated.
The Unicode Standard is far more than a simple encoding of characters. The standard also associates a rich set of semantics with each encoded character—properties that are required for interoperability and correct behavior in implementations, as well as for Unicode conformance. These semantics are cataloged in the Unicode Character Database (UCD), a collection of data files which contain the Unicode character code points and character names. The data files define the Unicode character properties and mappings between Unicode characters (such as case mappings).
This annex describes the UCD and provides a guide to the various documentation files associated with it. Additional information about character properties and their use is contained in the Unicode Standard and its annexes. In particular, implementers should familiarize themselves with the formal definitions and conformance requirements for properties detailed in Section 3.5, Properties in [Unicode] and with the material in Chapter 4, Character Properties in [Unicode]. Additional discussion about the Unicode character property model can be found in [UTR23].
The latest version of the UCD is always located on the Unicode website at:
https://www.unicode.org/Public/UCD/latest/
The specific files for the UCD associated with this version of the Unicode Standard (16.0.0) are located at:
https://www.unicode.org/Public/16.0.0/
Stable, archived versions of the UCD associated with all earlier versions of the Unicode Standard can be accessed from:
https://www.unicode.org/ucd/
For a description of the changes in the UCD for this version and earlier versions, see the UCD Change History.
The Unicode Character Database is an integral part of the Unicode Standard.
The UCD contains normative property and mapping information required for implementation of various Unicode algorithms such as the Unicode Bidirectional Algorithm, Unicode Normalization, and Unicode Casefolding. The data files also contain additional informative and provisional character property information.
Each specification of a Unicode algorithm, whether specified in the text of [Unicode] or in one of the Unicode Standard Annexes, designates which data file(s) in the UCD are needed to provide normative property information required by that algorithm.
For information on the meaning and application of the terms, normative, informative, contributory, and provisional, see Section 3.5, Properties in [Unicode].
For information about the applicable terms of use for the UCD, see the Unicode Terms of Use.
Some character properties in the UCD are simple properties. This status has no bearing on whether or not the properties are normative, but merely indicates that their values are not derived from some combination of other properties.
Other character properties are derived. This means that their values are derived by rule from some other combination of properties. Generally such rules are stated as set operations, and may or may not include explicit exception lists for individual characters.
Certain simple properties are defined merely to make the statement of the rule defining a derived property more compact or general. Such properties are known as contributory properties. Sometimes these contributory properties are defined to encapsulate the messiness inherent in exception lists. At other times, a contributory property may be defined to help stabilize the definition of an important derived property which is subject to stability guarantees.
Derived character properties are not considered second-class citizens among Unicode character properties. They are defined to make implementation of important algorithms easier to state. Included among the first-class derived properties important for such implementations are: Uppercase, Lowercase, XID_Start, XID_Continue, Math, and Default_Ignorable_Code_Point, all defined in DerivedCoreProperties.txt, as well as derived properties for the optimization of normalization, defined in DerivedNormalizationProps.txt.
Implementations should simply use the derived properties, and should not try to rederive them from lists of simple properties and collections of rules, because of the chances for error and divergence when doing so.
Definitions of property derivations are provided for information only, typically in comment fields in the data files. Such definitions may be refactored, refined, or corrected over time. These definitions are presented in a modified set notation, expressed as set additions and/or subtractions of various other property values. For example:
# Derived Property: ID_Start # Characters that can start an identifier. # Generated from: # Lu + Ll + Lt + Lm + Lo + Nl # + Other_ID_Start # - Pattern_Syntax # - Pattern_White_Space
When interpreting definitions of derived properties of this sort, keep in mind that set subtraction is not a commutative operation. Thus "Lo + Lm - Pattern_Syntax" defines a different set than "Lo - Pattern_Syntax + Lm". The order of property set operations stated in the definitions affects the composition of the derived set.
If there are any cases of mismatches between the definition of a derived property as listed in DerivedCoreProperties.txt or similar data files in the UCD, and the definition of a derived property as a set definition rule, the explicit listing in the data file should always be taken as the normative definition of the property. As described in Stability of Releases the property listing in the data files for any given version of the standard will never change for that version.
In limited cases, a Unicode character property defined in the Unicode Character Database may have an external dependency on another specification which is not a part of the Unicode Standard, and whose data is not formally part of the UCD. In such cases, version stability for the UCD is attained by requiring that dependency to be based on a known, published version of the external specification.
Starting with Version 10.0 of the UCD and continuing through Version 12.1, the clear example of such an external dependency was the derivation of some segmentation-related character properties, in part based on emoji properties associated with UTS #51, "Unicode Emoji" [UTS51]. The details of the derivation were described in the respective annexes, [UAX14] and [UAX29], as well as in the documentation portions of the associated UCD property files. See [Data14] and [Props]. The version of UTS #51 used for those segmentation properties in each of the relevant versions of the UCD was clearly identified in those annexes and data files. Starting with Version 13.0 of the UCD, however, the emoji properties which the UCD previously depended on have been formally incorporated into the UCD, so that they no longer constitute an external dependency.
An external dependency may impact either a simple or a derived property.
Unicode character properties have default values. Default values are the value or values that a character property takes for an unassigned code point, or in some instances, for designated subranges of code points, whether assigned or unassigned. For example, the default value of a binary Unicode character property is always "N".
For the formal discussion of default values, see D26 in Section 3.5, Properties in [Unicode]. For conventions related to default values in various data files of the UCD and for documentation regarding the particular default values of individual Unicode character properties, see Default Values.
Just as for the Unicode Standard as a whole, each version of the UCD, once published, is absolutely stable and will never change. Each released version is archived in a directory on the Unicode website, with a directory number associated with that version. URLs pointing to that version's directory are also stable and will be maintained in perpetuity.
Any errors discovered for a released version of the UCD are noted in [Errata], and if appropriate will be corrected in a subsequent version of the UCD.
Stability guarantees constraining how Unicode character properties can (or cannot) change between releases of the UCD are documented in the Unicode Consortium Stability Policies [Stability].
Updates to character properties in the Unicode Character Database may be required for any of three reasons:
While the Unicode Consortium endeavors to keep the values of all character properties as stable as possible between versions, occasionally circumstances may arise which require changing them. In particular, as less well-documented scripts, such as those for minority languages, or historic scripts are added to the standard, the exact character properties and behavior may not fully be known when the script is first encoded. The properties for some of these characters may change as further information becomes available or as implementations turn up problems in the initial property assignments. As far as possible, any readjustment of property values based on growing implementation experience is made to be compatible with established practice.
All changes to normative or informative property values, to the status or type of a property, or to property or property value aliases, must be approved by an explicit decision taken by the Unicode Technical Committee. Changes to provisional property values are subject to less stringent oversight.
Occasionally, a character property value is changed to prevent incorrect generalizations about a character's use based on its nominal property values. For example, U+200B ZERO WIDTH SPACE was originally classified as a space character (General_Category=Zs), but it was reclassified as a Format character (General_Category=Cf) to clearly distinguish it from space characters in its function as a format control for line breaking.
There is no guarantee that a particular value for an enumerated property will actually have characters associated with it. Also, because of changes in property value assignments between versions of the standard, a property value that once had characters associated with it may later have none. Such conditions and changes are rare, but implementations must not assume that all property values are associated with non-null sets of characters. For example, currently the special Script property value Katakana_Or_Hiragana has no characters associated with it.
In some instances an entire property may become obsolete. For example, the ISO_Comment property was once used to keep track of annotations for characters used in the production of name lists for ISO/IEC 10646 code charts. As of Unicode 5.2.0 that property became obsolete, and its value is now defaulted to the null string for all Unicode code points.
An obsolete property is never removed from the UCD.
Occasionally an obsolete property may also be formally deprecated. This is an indication that the property is no longer recommended for use, perhaps because its original intent has been replaced by another property or because its specification was somehow defective. See also the general discussion of Deprecation.
A deprecated property is never removed from the UCD.
Table 1 lists the properties that are formally deprecated as of this version of the Unicode Standard.
Table 1. Deprecated Properties
Property Name | Deprecation Version | Reason |
---|---|---|
Grapheme_Link | 5.0.0 | Duplication of ccc=9 |
Hyphen | 6.0.0 | Supplanted by Line_Break property values |
ISO_Comment | 6.0.0 | No longer needed for chart generation; otherwise not useful |
Expands_On_NFC | 6.0.0 | Less useful than UTF-specific calculations |
Expands_On_NFD | 6.0.0 | Less useful than UTF-specific calculations |
Expands_On_NFKC | 6.0.0 | Less useful than UTF-specific calculations |
Expands_On_NFKD | 6.0.0 | Less useful than UTF-specific calculations |
FC_NFKC_Closure | 6.0.0 | Supplanted in usage by NFKC_Casefold; otherwise not useful |
Another possibility is that an obsolete property may be declared to be stabilized. Such a determination does not indicate that the property should or should not be used; instead it is a declaration that the UTC (Unicode Technical Committee) will no longer actively maintain the property or extend it for newly encoded characters. The property values of a stabilized property are frozen as of a particular release of the standard.
A stabilized property is never removed from the UCD.
Table 2 lists the properties that are formally stabilized as of this version of the Unicode Standard.
Table 2. Stabilized Properties
Property Name | Stabilization Version |
---|---|
Hyphen | 4.0.0 |
ISO_Comment | 6.0.0 |
This annex provides the core documentation for the UCD, but additional information about character properties is available in other parts of the standard and in additional documentation files contained within the UCD.
The formal definitions related to character properties used by the Unicode Standard are documented in Section 3.5, Properties in [Unicode]. Understanding those definitions and related terminology is essential to the appropriate use of Unicode character properties.
See Section 4.1, Unicode Character Database, in [Unicode] for a general discussion of the UCD and its use in defining properties. The rest of Chapter 4 provides important explanations regarding the meaning and use of various normative character properties.
For a general discussion of the property model which underlies the definitions associated with the UCD, see Unicode Technical Report #23, "The Unicode Character Property Model" [UTR23]. That technical report is informative, but over the years various content from it has been incorporated into normative portions of the Unicode Standard, particularly for the definitions in Chapter 3.
UTR #23 presents the important distinction between properties defined for strings (in contrast to properties defined for characters or code points) and character properties that have values that are strings. The latter are referred to as string-valued properties in UTR #23 and in this annex. UTR #23 also discusses string functions and their relation to character properties.
NamesList.html formally describes the format of the NamesList.txt data file in BNF. That data file is used to drive the PDF formatting of the Unicode code charts and names list. See also Section 24.1, Character Names List, in [Unicode] for a detailed discussion of the conventions used in the Unicode names list as formatted for the online code charts.
StandardizedVariants.html has been obsoleted as of Version 9.0 of the UCD. This file formerly documented standardized variants, showing a representative glyph for each. It was closely tied to the data file, StandardizedVariants.txt, which defines those sequences normatively.
The function of StandardizedVariants.html to show representative glyphs for standardized variants has been superseded. There are now better means of illustrating the glyphs. Many standardized variation sequences are shown in the Unicode code charts directly, in summary sections at the ends of the names list for any block which contains them. Glyphs for standardized variants of CJK compatibility ideographs are also shown directly in the Unicode code charts.
Emoji variation sequences are a special class of variation sequences involving emoji characters. They are divided into two subtypes: an emoji presentation sequence, consisting of an emoji character base followed by the variation selector U+FE0F, and a text presentation sequence, consisting of an emoji character base followed by the variation selector U+FE0E. Such sequences come in pairs: the text presentation sequence shown with a black and white presentation, as seen in the Unicode code charts, and the emoji presentation sequence shown with a colorful icon, as usually seen in implementations on mobile devices and elsewhere.
Starting with Version 9.0.0, the following page in the Unicode emoji subsite area shows appropriate representative glyphs for all emoji variation sequences, with separate columns for text presentation sequences and for emoji presentation sequences:
https://www.unicode.org/emoji/charts/emoji-variants.html
The data file which defines the exact list of emoji variation sequences is emoji-variation-sequences.txt. That file is maintained in the UCD, but emoji variation sequences are documented in Unicode Technical Standard #51, Unicode Emoji [UTS51].
Unicode Standard Annex #38, "Unicode Han Database (Unihan)" [UAX38] describes the format and content of the Unihan Database [Unihan], which collects together all property information for CJK unified ideographs. That annex also specifies in detail which of the Unihan character properties are normative, informative, or provisional.
The Unihan Database contains extensive and detailed mapping information for CJK unified ideographs encoded in the Unicode Standard, but it is aimed only at those ideographs, not at other characters used in the East Asian context in general. In contrast, East Asian legacy character sets, including important commercial and national character set standards, contain many non-CJK characters. As a result, the Unihan Database must be supplemented from other sources to establish mapping tables for those character sets.
The majority of the content of the Unihan Database is released for each version of the Unicode Standard as a collection of Unihan data files in the UCD. Because of their large size, these data files are released only as a zipped file, Unihan.zip. The details of the particular data files in Unihan.zip and the CJK properties each one contains are provided in [UAX38]. For versions of the UCD prior to Version 5.2.0, all of the CJK properties were listed together in a very large, single file, Unihan.txt.
Unicode Standard Annex #45, "U-Source Ideographs" [UAX45] describes the format of USourceData.txt, which lists all of the information for UTC-Source ideographs.
In addition to the specific documentation files for the UCD, individual data files often contain extensive header comments describing their content and any special conventions used in the data.
In some instances, individual property definition sections also contain comments with information about how the property may be derived. Such comments are informative; while they are intended to convey the intent of the derivation, in case of any mismatch between a statement of a derivation in a comment field and the actual listing of the derived property, the list is considered to be definitive. See Simple and Derived Properties.
UCD.html was formerly the primary documentation file for the UCD. As of Version 5.2.0, its content has been wholly incorporated into this document.
Unihan.html was formerly the primary documentation file for the Unihan Database. As of Version 5.1.0, its content has been wholly incorporated into [UAX38].
Versions of the Unicode Standard prior to Version 4.0.0 contained small, focused documentation files, UnicodeCharacterDatabase.html, PropList.html, and DerivedProperties.html, which were later consolidated into UCD.html.
StandardizedVariants.html has been obsoleted as of Version 9.0.0. See Section 3.4, StandardizedVariants.html.
The heart of the UCD consists of the data files themselves. This section describes the directory structure for the UCD, the format conventions for the data files, and provides documentation for data files not documented elsewhere in this annex.
Each version of the UCD is released in a separate, numbered directory under the Public directory on the Unicode website. The content of that directory is complete for that release. It is also stable—once released, it will be archived permanently in that directory, unchanged, at a stable URL.
The specific files for the UCD associated with this version of the Unicode Standard (16.0.0) are located at:
https://www.unicode.org/Public/16.0.0/
The latest released version of the UCD is always accessible via the following stable URL:
https://www.unicode.org/Public/UCD/latest/
Zipped copies of the latest released version of the UCD are always accessible via the following stable URL:
https://www.unicode.org/Public/zipped/latest/
Prior to Version 6.3.0, access to the latest released version of the UCD was via the following stable URL:
https://www.unicode.org/Public/UNIDATA/
That "UNIDATA" URL will be maintained, but is no longer recommended, because it points to the ucd subdirectory of the latest release, rather than to the parent directory for the release. The "UNIDATA" naming convention is also very old, and does not follow the directory naming conventions currently used for other data releases in the Public directory on the Unicode website.
The UCD proper is located in the ucd subdirectory of the numbered version directory. That directory contains all of the documentation files and most of the data files for the UCD, including some data files for derived properties.
Although all UCD data files are version-specific for a release and most contain internal date and version stamps, the file names of the released data files do not differ from version to version. When linking to a version-specific data file, the version will be indicated by the version number of the directory for the release.
All files for derived extracted properties are in the extracted subdirectory of the ucd subdirectory. See Derived Extracted Properties for documentation regarding those data files and their content.
A number of auxiliary properties are specified in files in the auxiliary subdirectory of the ucd subdirectory. It contains data files specifying properties associated with Unicode Standard Annex #29, "Unicode Text Segmentation" [UAX29] and with Unicode Standard Annex #14, "Unicode Line Breaking Algorithm" [UAX14], as well as test data for those algorithms. See Segmentation Test Files and Documentation for more information about the test data.
Certain data files associated with emoji properties are maintained in the emoji subdirectory of the ucd subdirectory. Those data files define the simple character properties associated with emoji characters, as well as the emoji variation sequences. Other data files associated with emoji, including those which define the RGI ("recommended for general interchange") sets of various types of emoji sequences, as well as emoji test data, are maintained elsewhere, and are not considered formally a part of the UCD. See [UTS51] for documentation regarding those data files and their content.
The XML version of the UCD is located in the ucdxml subdirectory of the numbered version directory. See the UCD in XML for more details.
The code charts specific to a version of Unicode are archived as a single large PDF file in the charts subdirectory of the numbered version directory. See the readme.txt in that subdirectory and the general web page explaining the Unicode Code Charts for more details.
Prior to the formal release of a version of the UCD, draft files are made available for review in a subdirectory named draft, under the /Public directory on the Unicode server. The files in this directory may include temporary files, including documentation of differences between draft versions. The number of reviews is not fixed—a beta review will always take place, but an alpha review is optional.
Notices contained in a ReadMe.txt file in the draft/UCD directory during the beta review period also make it clear that that directory contains preliminary material under review, rather than a final, stable release.
The UCD in XML was introduced in Version 5.1.0, so UCD directories prior to that do not contain the ucdxml subdirectory.
UCD directories prior to Version 13.0.0 do not contain the emoji subdirectory.
UCD directories prior to Version 4.1.0 do not contain the auxiliary subdirectory.
UCD directories prior to Version 3.2.0 do not contain the extracted subdirectory.
The general structure of the file directory for a released version of the UCD described above applies to Versions 4.1.0 and later. Prior to Version 4.1.0, versions of the UCD were not self-contained, complete sets of data files for that version, but instead only contained any new data files or any data files which had changed since the prior release.
Because of this, the property files for a given version prior to Version 4.1.0 can be spread over several directories. Consult the component listings at Enumerated Versions to find out which files in which directories comprise a complete set of data files for that version.
The directory naming conventions and the file naming conventions also differed prior to Version 4.1.0. So, for example, Version 4.0.0 of the UCD is contained in a directory named 4.0-Update, and Version 4.0.1 of the UCD in a directory named 4.0-Update1. Furthermore, for these earlier versions, the data file names do contain explicit version numbers.
Files in the UCD use the format conventions described in this section, unless otherwise specified.
0000..007F; Basic Latin 0080..00FF; Latin-1 Supplement
For character ranges using this convention, the names of all characters in the range are algorithmically derivable. See Section 4.8, Name in [Unicode] for more information on derivation of character names for such ranges.4E00;;Lo;0;L;;;;;N;;;;; 9FEF; ;Lo;0;L;;;;;N;;;;;
09B2 ; Bengali # Lo BENGALI LETTER LA
L& as used in these comments is an alias for the derived LC value (cased letter) for the General_Category property, as documented in PropertyValueAliases.txt.0386 ; Greek # L& GREEK CAPITAL LETTER ALPHA WITH TONOS
00BC..00BE ; Numeric # No [3] VULGAR FRACTION ONE QUARTER..VULGAR FRACTION THREE QUARTERS
2065 ; Default_Ignorable_Code_Point # Cn
Table 3. Code Point Label Tags
Tag | General_Category | Note |
---|---|---|
reserved | Cn | Noncharacter_Code_Point=F |
noncharacter | Cn | Noncharacter_Code_Point=T |
control | Cc | |
private-use | Co | |
surrogate | Cs |
03D2 ; FC_NFKC; 03C5 # L& GREEK UPSILON WITH HOOK SYMBOL 03D3 ; FC_NFKC; 03CD # L& GREEK UPSILON WITH ACUTE AND HOOK SYMBOL
1680 ; White_Space # Zs OGHAM SPACE MARK 2000..200A ; White_Space # Zs [11] EN QUAD..HAIR SPACE
0640 ; Adlm Arab Mand Mani Phlp Rohg Sogd Syrc # Lm ARABIC TATWEEL
# All code points not explicitly listed for Script # have the value Unknown (Zzzz).
Default values for common catalog, enumeration, and numeric properties are listed in Table 4, along with the exceptional binary property, Extended_Pictographic. Further explanation is provided below the table, in those cases where the default values are complex, as indicated in the third column.
Table 4. Default Values for Properties
Property Name | Default Value(s) | Complex? |
---|---|---|
Age | Unassigned (= NA) | No |
Bidi_Class | L, AL, R, BN, ET | Yes |
Block | No_Block | No |
Canonical_Combining_Class | Not_Reordered (= 0) | No |
Decomposition_Type | None | No |
East_Asian_Width | Neutral (= N), Wide (= W) | Yes |
Extended_Pictographic | N (= False), Y (= True) | Yes |
General_Category | Cn | No |
Line_Break | Unknown (= XX), ID, PR | Yes |
Numeric_Type | None | No |
Numeric_Value | NaN | No |
Script | Unknown (= Zzzz) | No |
Vertical_Orientation | Rotated (= R), Upright (= U) | Yes |
Complex default values are those which take multiple values, contingent on code point ranges or other conditions. Complex default values other than those specified in the "@missing" line are explicitly listed in the relevant property file, except for instances noted in this section. This means that a parser extracting property values from the UCD should never encounter an ambiguous condition for which the default value of a property for a particular code point is unclear.
Specially-formatted comment lines with the keyword "@missing" are used to define default property values for ranges of code points not explicitly listed in a data file. These lines follow regular conventions that make them machine-readable.
An @missing line starts with the comment character "#", followed by a space, then the "@missing" keyword, followed by a colon, another space, a code point range, and a semicolon. Then the line typically continues with a semicolon-delimited list of one or more default property values. For example:
# @missing: 0000..10FFFF; Unknown
In general, the code point range and semicolon-delimited list follow the same syntactic conventions as the data file in which the @missing line occurs, so that any parser which interprets that data file can easily be adapted to also parse and interpret an @missing line to pick up default property values for code points.
@missing lines are also supplied for many properties in the file PropertyValueAliases.txt. In this case, because there are many @missing lines in that single data file, each @missing line in that file uses the syntactic pattern code_point_range; property_name; default_prop_val.
An @missing line is never provided for a binary property, because the default value for binary properties is always "N" and need not be defined redundantly for each binary property.
Because of the addition of property names when @missing lines are included in PropertyValueAliases.txt, there are currently two syntactic patterns used for @missing lines, as summarized schematically below:
In this schematic representation, "default_prop_val" stands in for
either an explicit property value or for a special tag such as