Character set conversion

When translating XML to legacy ASCII-based formats with poor support for Unicode, such as man pages and Texinfo, there is always the problem that Unicode characters in the source document also have to be translated somehow.

A straightforward character set conversion from Unicode does not suffice, because the target character set, usually US-ASCII or ISO Latin-1, do not contain common characters such as dashes and directional quotation marks that are widely used in XML documents. But document formatters (man and Texinfo) allow such characters to be entered by a markup escape: for example, \(lq for the left directional quote . And if a markup-level escape is not available, an ASCII transliteration might be used: for example, using the ASCII less-than sign < for the angle quotation mark .

So the Unicode character problem can be solved in two steps:

  1. utf8trans, a program included in docbook2X, maps Unicode characters to markup-level escapes or transliterations.

    Since there is not necessarily a fixed, official mapping of Unicode characters, utf8trans can read in user-modifiable character mappings expressed in text files and apply them. (Unlike most character set converters.)

    In charmaps/man/roff.charmap and charmaps/man/texi.charmap are character maps that may be used for man-page and Texinfo conversion. The programs db2x_manxml and db2x_texixml will apply these character maps, or another character map specified by the user, automatically.

  2. The rest of the Unicode text is converted to some other character set (encoding). For example, a French document with accented characters (such as é) might be converted to ISO Latin 1.

    This step is applied after utf8trans character mapping, using the iconv encoding conversion tool. Both db2x_manxml and db2x_texixml can call iconv automatically when producing their output.