[ << ] [ >> ]           [Top] [Contents] [Index] [ ? ]

11. Word breaks in strings <uniwbrk.h>

This include file declares functions for determining where in a string “words” start and end. Here “words” are not necessarily the same as entities that can be looked up in dictionaries, but rather groups of consecutive characters that should not be split by text processing operations.


11.1 Word breaks in a string

The following functions determine the word breaks in a string.

Function: void u8_wordbreaks (const uint8_t *s, size_t n, char *p)
Function: void u16_wordbreaks (const uint16_t *s, size_t n, char *p)
Function: void u32_wordbreaks (const uint32_t *s, size_t n, char *p)
Function: void ulc_wordbreaks (const char *s, size_t n, char *p)

Determines the word break points in s, an array of n units, and stores the result at p[0..n-1].

p[i] = 1

means that there is a word boundary between s[i-1] and s[i].

p[i] = 0

means that s[i-1] and s[i] must not be separated.

p[0] is always set to 0. If an application wants to consider a word break to be present at the beginning of the string (before s[0]) or at the end of the string (after s[0..n-1]), it has to treat these cases explicitly.


11.2 Word break property

This is a more low-level API. The word break property is a property defined in Unicode Standard Annex #29, section “Word Boundaries”, see https://www.unicode.org/reports/tr29/#Word_Boundaries. It is used for determining the word breaks in a string.

The following are the possible values of the word break property. More values may be added in the future.

Constant: int WBP_OTHER
Constant: int WBP_CR
Constant: int WBP_LF
Constant: int WBP_NEWLINE
Constant: int WBP_EXTEND
Constant: int WBP_FORMAT
Constant: int WBP_KATAKANA
Constant: int WBP_ALETTER
Constant: int WBP_MIDNUMLET
Constant: int WBP_MIDLETTER
Constant: int WBP_MIDNUM
Constant: int WBP_NUMERIC
Constant: int WBP_EXTENDNUMLET
Constant: int WBP_RI
Constant: int WBP_DQ
Constant: int WBP_SQ
Constant: int WBP_HL
Constant: int WBP_ZWJ
Constant: int WBP_EB
Constant: int WBP_EM
Constant: int WBP_GAZ
Constant: int WBP_EBG

The following function looks up the word break property of a character.

Function: int uc_wordbreak_property (ucs4_t uc)

Returns the Word_Break property of a Unicode character.


[ << ] [ >> ]           [Top] [Contents] [Index] [ ? ]

This document was generated by Bruno Haible on February, 24 2024 using texi2html 1.78a.