[ << ] [ >> ]           [Top] [Contents] [Index] [ ? ]

13. Normalization forms (composition and decomposition) <uninorm.h>

This include file defines functions for transforming Unicode strings to one of the four normal forms, known as NFC, NFD, NKFC, NFKD. These transformations involve decomposition and — for NFC and NFKC — composition of Unicode characters.


13.1 Decomposition of Unicode characters

The following enumerated values are the possible types of decomposition of a Unicode character.

Constant: int UC_DECOMP_CANONICAL

Denotes canonical decomposition.

Constant: int UC_DECOMP_FONT

UCD marker: <font>. Denotes a font variant (e.g. a blackletter form).

Constant: int UC_DECOMP_NOBREAK

UCD marker: <noBreak>. Denotes a no-break version of a space or hyphen.

Constant: int UC_DECOMP_INITIAL

UCD marker: <initial>. Denotes an initial presentation form (Arabic).

Constant: int UC_DECOMP_MEDIAL

UCD marker: <medial>. Denotes a medial presentation form (Arabic).

Constant: int UC_DECOMP_FINAL

UCD marker: <final>. Denotes a final presentation form (Arabic).

Constant: int UC_DECOMP_ISOLATED

UCD marker: <isolated>. Denotes an isolated presentation form (Arabic).

Constant: int UC_DECOMP_CIRCLE

UCD marker: <circle>. Denotes an encircled form.

Constant: int UC_DECOMP_SUPER

UCD marker: <super>. Denotes a superscript form.

Constant: int UC_DECOMP_SUB

UCD marker: <sub>. Denotes a subscript form.

Constant: int UC_DECOMP_VERTICAL

UCD marker: <vertical>. Denotes a vertical layout presentation form.

Constant: int UC_DECOMP_WIDE

UCD marker: <wide>. Denotes a wide (or zenkaku) compatibility character.

Constant: int UC_DECOMP_NARROW

UCD marker: <narrow>. Denotes a narrow (or hankaku) compatibility character.

Constant: int UC_DECOMP_SMALL

UCD marker: <small>. Denotes a small variant form (CNS compatibility).

Constant: int UC_DECOMP_SQUARE

UCD marker: <square>. Denotes a CJK squared font variant.

Constant: int UC_DECOMP_FRACTION

UCD marker: <fraction>. Denotes a vulgar fraction form.

Constant: int UC_DECOMP_COMPAT

UCD marker: <compat>. Denotes an otherwise unspecified compatibility character.

The following constant denotes the maximum size of decomposition of a single Unicode character.

Macro: unsigned int UC_DECOMPOSITION_MAX_LENGTH

This macro expands to a constant that is the required size of buffer passed to the uc_decomposition and uc_canonical_decomposition functions.

The following functions decompose a Unicode character.

Function: int uc_decomposition (ucs4_t uc, int *decomp_tag, ucs4_t *decomposition)

Returns the character decomposition mapping of the Unicode character uc. decomposition must point to an array of at least UC_DECOMPOSITION_MAX_LENGTH ucs_t elements.

When a decomposition exists, decomposition[0..n-1] and *decomp_tag are filled and n is returned. Otherwise -1 is returned.

Function: int uc_canonical_decomposition (ucs4_t uc, ucs4_t *decomposition)

Returns the canonical character decomposition mapping of the Unicode character uc. decomposition must point to an array of at least UC_DECOMPOSITION_MAX_LENGTH ucs_t elements.

When a decomposition exists, decomposition[0..n-1] is filled and n is returned. Otherwise -1 is returned.

Note: This function returns the (simple) “canonical decomposition” of uc. If you want the “full canonical decomposition” of uc, that is, the recursive application of “canonical decomposition”, use the function u*_normalize with argument UNINORM_NFD instead.


13.2 Composition of Unicode characters

The following function composes a Unicode character from two Unicode characters.

Function: ucs4_t uc_composition (ucs4_t uc1, ucs4_t uc2)

Attempts to combine the Unicode characters uc1, uc2. uc1 is known to have canonical combining class 0.

Returns the combination of uc1 and uc2, if it exists. Returns 0 otherwise.

Not all decompositions can be recombined using this function. See the Unicode file ‘CompositionExclusions.txt’ for details.


13.3 Normalization of strings

The Unicode standard defines four normalization forms for Unicode strings. The following type is used to denote a normalization form.

Type: uninorm_t

An object of type uninorm_t denotes a Unicode normalization form. This is a scalar type; its values can be compared with ==.

The following constants denote the four normalization forms.

Macro: uninorm_t UNINORM_NFD

Denotes Normalization form D: canonical decomposition.

Macro: uninorm_t UNINORM_NFC

Normalization form C: canonical decomposition, then canonical composition.

Macro: uninorm_t UNINORM_NFKD

Normalization form KD: compatibility decomposition.

Macro: uninorm_t UNINORM_NFKC

Normalization form KC: compatibility decomposition, then canonical composition.

The following functions operate on uninorm_t objects.

Function: bool uninorm_is_compat_decomposing (uninorm_t nf)

Tests whether the normalization form nf does compatibility decomposition.

Function: bool uninorm_is_composing (uninorm_t nf)

Tests whether the normalization form nf includes canonical composition.

Function: uninorm_t uninorm_decomposing_form (uninorm_t nf)

Returns the decomposing variant of the normalization form nf. This maps NFC,NFD → NFD and NFKC,NFKD → NFKD.

The following functions apply a Unicode normalization form to a Unicode string.

Function: uint8_t * u8_normalize (uninorm_t nf, const uint8_t *s, size_t n, uint8_t *resultbuf, size_t *lengthp)
Function: uint16_t * u16_normalize (uninorm_t nf, const uint16_t *s, size_t n, uint16_t *resultbuf, size_t *lengthp)
Function: uint32_t * u32_normalize (uninorm_t nf, const uint32_t *s, size_t n, uint32_t *resultbuf, size_t *lengthp)

Returns the specified normalization form of a string.

The resultbuf and lengthp arguments are as described in chapter Conventions.


13.4 Normalizing comparisons

The following functions compare Unicode string, ignoring differences in normalization.

Function: int u8_normcmp (const uint8_t *s1, size_t n1, const uint8_t *s2, size_t n2, uninorm_t nf, int *resultp)
Function: int u16_normcmp (const uint16_t *s1, size_t n1, const uint16_t *s2, size_t n2, uninorm_t nf, int *resultp)
Function: int u32_normcmp (const uint32_t *s1, size_t n1, const uint32_t *s2, size_t n2, uninorm_t nf, int *resultp)

Compares s1 and s2, ignoring differences in normalization.

nf must be either UNINORM_NFD or UNINORM_NFKD.

If successful, sets *resultp to -1 if s1 < s2, 0 if s1 = s2, 1 if s1 > s2, and returns 0. Upon failure, returns -1 with errno set.

Function: char * u8_normxfrm (const uint8_t *s, size_t n, uninorm_t nf, char *resultbuf, size_t *lengthp)
Function: char * u16_normxfrm (const uint16_t *s, size_t n, uninorm_t nf, char *resultbuf, size_t *lengthp)
Function: char * u32_normxfrm (const uint32_t *s, size_t n, uninorm_t nf, char *resultbuf, size_t *lengthp)

Converts the string s of length n to a NUL-terminated byte sequence, in such a way that comparing u8_normxfrm (s1) and u8_normxfrm (s2) with the u8_cmp2 function is equivalent to comparing s1 and s2 with the u8_normcoll function.

nf must be either UNINORM_NFC or UNINORM_NFKC.

The resultbuf and lengthp arguments are as described in chapter Conventions.

Function: int u8_normcoll (const uint8_t *s1, size_t n1, const uint8_t *s2, size_t n2, uninorm_t nf, int *resultp)
Function: int u16_normcoll (const uint16_t *s1, size_t n1, const uint16_t *s2, size_t n2, uninorm_t nf, int *resultp)
Function: int u32_normcoll (const uint32_t *s1, size_t n1, const uint32_t *s2, size_t n2, uninorm_t nf, int *resultp)

Compares s1 and s2, ignoring differences in normalization, using the collation rules of the current locale.

nf must be either UNINORM_NFC or UNINORM_NFKC.

If successful, sets *resultp to -1 if s1 < s2, 0 if s1 = s2, 1 if s1 > s2, and returns 0. Upon failure, returns -1 with errno set.


13.5 Normalization of streams of Unicode characters

A “stream of Unicode characters” is essentially a function that accepts an ucs4_t argument repeatedly, optionally combined with a function that “flushes” the stream.

Type: struct uninorm_filter

This is the data type of a stream of Unicode characters that normalizes its input according to a given normalization form and passes the normalized character sequence to the encapsulated stream of Unicode characters.

Function: struct uninorm_filter * uninorm_filter_create (uninorm_t nf, int (*stream_func) (void *stream_data, ucs4_t uc), void *stream_data)

Creates and returns a normalization filter for Unicode characters.

The pair (stream_func, stream_data) is the encapsulated stream. stream_func (stream_data, uc) receives the Unicode character uc and returns 0 if successful, or -1 with errno set upon failure.

Returns the new filter, or NULL with errno set upon failure.

Function: int uninorm_filter_write (struct uninorm_filter *filter, ucs4_t uc)

Stuffs a Unicode character into a normalizing filter. Returns 0 if successful, or -1 with errno set upon failure.

Function: int uninorm_filter_flush (struct uninorm_filter *filter)

Brings data buffered in the filter to its destination, the encapsulated stream.

Returns 0 if successful, or -1 with errno set upon failure.

Note! If after calling this function, additional characters are written into the filter, the resulting character sequence in the encapsulated stream will not necessarily be normalized.

Function: int uninorm_filter_free (struct uninorm_filter *filter)

Brings data buffered in the filter to its destination, the encapsulated stream, then closes and frees the filter.

Returns 0 if successful, or -1 with errno set upon failure.


[ << ] [ >> ]           [Top] [Contents] [Index] [ ? ]

This document was generated by Bruno Haible on February, 24 2024 using texi2html 1.78a.