ambiguous.txt 2.49 KB
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
Unicode defines width information for characters.  Conventionally this
describes the number of columns a character is expected to occupy when
printed or drawn using a monospaced font.

There are five width classes with which we concern ourselves.  Four of
these are narrow, wide, half-width, and full-width.  For practical
purposes, narrow and half-width can be grouped together as
"single-width" (occupying one column), and wide and full-width can be
grouped together as "double-width" (occupying two columns).

The last class we're concerned with is those of ambiguous width.  These
are characters which have the same meaning and graphical representation
everywhere, but which are either single-width or double-width based on
the context in which they appear.

Width information is crucial for terminal-based applications which need
to address the screen:  if the application draws five characters and
expects the cursor to be in moved six columns to the right, and the
terminal moves the cursor seven (or five, or any number other than six),
display bugs manifest.

Ambiguously-wide characters pose an implementation problem for terminals
which may not be running in the same locale as an application which is
running inside the terminal.  In these cases, the terminal cannot depend
on the libc wcwidth() function because wcwidth() typically makes use of
locale information.

There are basically four approaches to solving this problem:
A) Force characters with ambiguous width to be single-width.
B) Force characters with ambiguous width to be double-width.
C) Force characters with ambiguous width to be have a width value based
   on the locale's region.
D) Force characters with ambiguous width to be have a width value based
   on the locale's encoding.

Methods A and B will produce display bugs, because they don't take into
account any context information.  Method C fails on glibc-based systems
because glibc uses method D and the two methods produce different
results for the same wchar_t values.

So the VteTerminal widget uses approach D.  Depending on the context in
which a character was received (a combination of the terminal's encoding
and whether or not the character was received as an ISO-2022 sequence),
a character is internally assigned a width when it is received from the
terminal.

Text which is not received from the terminal (input method preedit data)
is processed using method C, although now that I think about it, the
fact that it's UTF-8 text suggests that these characters should be
treated as single-width.