![]() |
& |
![]() |
This project is part of the Google Summer of Code program to promote the open source software development. I am very glad that my proposal was approved. Thanks to Google's Summer of Code Program. I will make every effort to make this project a success.
NetBSD is "a free, secure, and highly portable Unix-like Open Source operating system available for many platforms, from 64-bit Opteron machines and desktop systems to handheld and embedded devices. Its clean design and advanced features make it excellent in both production and research environments, and it is user-supported with complete source." For more information, please visit NetBSD official web site.
Projected milestones:
Status: Finished documentation.
To do: More tests with other wide character locales.
Final report converted as a longer article for
Daemon News.
Status: Finished documentaion.
To do: More tests with other wide character locales.
Final report converted as an article for
Daemon News.
Status: Finished documentation.
To do: More tests with other wide character locales.
Final report converted as an article for
Daemon News.
The curses wide character support comes as patches to the NetBSD-current distribution. The patches are generated from a baseline version of NetBSD-current source checked out on June 28, 2005. To integrate it, just run patch command to merge the changes into your current source.
If we can get permission from The Open Group to use the Single Unix Specification (SUS) in NetBSD, it is great because we don't have to write a whole new set of man pages from scratch.
If we can't get the right permission, the changes to the curses manual pages should be minor too. Simple changes to existing curses(3) man pages, indicating availability of wide character support. Some newly added functions specified in the X/Open Reference will be included in individual man pages, such as [mv][w]add_wch() in curses_addch(3). In addition, a new option may be added to the "SYNOPSIS" section to enable/disable wide character support.
I will start with unit tests for individual functions that are added or
modified. Then, we will use a simple file viewer borrowed from the
ncurses package test suite. Some modifications are made to suit our
needs. We will see if it works properly with wide character files.
Screenshots of the latest tests can be found here:
Simplified Chinese,
Traditional Chinese,
Japanese, and more are coming soon...
Memory usage:
Using the simple file viewer, I compared the memory footprint of different
curses libraries, including new NetBSD curses library with wide character
support (wcurses), traditional NetBSD curses library, ncurses library
with and without wide character support. I use ps(1) to make simple
relative comparisons of the same file viewer code linked against different
curses libraries. The tests use the same file viewer source code that
can call either narrow character functions as a narrow character viewer
(ccview and ncview) or wide character functions as a wide character
viewer (wcview and nwview). For the wide character tests, the two
viewers open the same Chinese locale text that spans multiple pages; for
the narrow character tests, the two viewer open the their own source code
(view.c). The tests are run on an i386 machine running NetBSD 2.0.
Here are the results:
| wcview | nwview | |
|---|---|---|
| SIZE | 2152K | 1308K |
| RES | 2984K | 2164K |
| wcview | nwview | tcview | ncview | |
|---|---|---|---|---|
| SIZE | 1440K | 1128K | 504K | 328K |
| RES | 2496K | 1964K | 1108K | 1128K |
The lack of support for wide characters in the NetBSD implementation of curses library limits the support for internationalized character sets, and thus limits the uses of NetBSD in countries using wide character sets. In particular, the problem lies in the NetBSD curses internal character storage and the related routines assume a 8-bit character in each position. To add wide character support, we need to add a new storage and a new set of wide character routines as specified in the X/Open Reference.
struct __ldata {
wchar_t ch; /* Character */
attr_t attr; /* Attributes */
wchar_t bch; /* Background character */
attr_t battr; /* Background attributes */
};
This storage structure initially was designed for narrow characters. It does not support wide characters although it uses 32-bit wchar_t and attr_t data types because because it assumes one character per position, and there is no character width field. In addition, it does not support non-spacing characters as specified in the X/Open Reference. Therefore, changes must be made to the storage structure to enable wide characters as well as non-spacing characters.
In addition to these newly added functions, some existing curses routines are shared by both wide character and narrow characters, but assume a 8-bit character per location. In addition, some functions directly access the storage cell when they add, insert, or delete, instead of using addch(), delch() or insch(). These functions should be identified and modified so that they work well with wide character routines. These functions are listed in the following categories:
In order to add support for wide characters in the NetBSD curses library, the curses internal storage of characters and attributes (as defined in $SRC/lib/libcurses/curses_private.h) needs to be modified. In particular, one possible data structure (__ldata) describing each character position could contain the following:
struct nschar_t {
wchar_t ch; /* Non-spacing character */
struct nschart_t *next; /* Next non-spacing character */
};
struct __ldata {
wchar_t ch; /* Character */
attr_t attr; /* Attributes */
wchar_t bch; /* Background character */
attr_t battr; /* Background attributes */
nschar_t *nsp; /* Foreground non-spacing character pointer */
nschar_t *bnsp; /* Background non-spacing character pointer */
};
In this storage structure, both the character value and attribute are 32 bits.
Such a data structure is generic enough to handle all wide character as
well as non-spacing characters. Besides, it should align nicely on 32-bit and
64-bit machines. We don't have a character width field to save some memory
because the width only needs at most 3 bits. Instead, we use part of the attribute
to specify the width. In particular, we currently use values in 0x03ffff00 as
the standard attributes. We could use the top bits (0xfc000000) to store the width.
The narrow character routines could mask off the wide attributes part
(input/output) and for the width to be one (input). On each line, there is one
storage cell for each column. For a m-column wide character, only the first
storage cell hold the width of the character, and the rest m-1 storage cells hold
the position information in their width fields. For example, if a 4-column character
is added to the screen, 4 storage cells will be changed and their character width
would get the contents (4, -1, -2, -3). Then, if we later come to add a character
at the 2nd, 3rd or 4th position, we know we are in the middle of a multi-column
character and can easily clear the other cells. In addition to non-spacing characters
in foreground, the X/Open reference also indicates that the background characters
can also include non-spacing characters, so we must have two linked lists for
non-spacing characters in each cell.
There are concerns of more than quadruply use of memory using this new storage structure. First of all, the memory increase is not so large, because the current character storage already uses wchat_t and an integer type attr_t. The only possible additional memory comes from the linked list of non-spacing characters, which do occur infrequently, and can be limited by an implementation to as low as five. So, the memory increase is at most 2.75 times the current space in the worst case, assuming every character has five non-spacing character associated and the uses of 32-bit pointers. In the most common cases, the memory uses only increase by 25%. On the other hand, such increase is an inevitable price to pay if we want to enable wide character support in curses library. We can provide storage structures for both narrow and wide characters, and make it a compile time option ("HAVE_WCHAR") such that the curses program developers can decide if they want to use the wide character support depending on the memory constraints.
In addition, we need to define the complex character data structure cchar_t, required by the X/Open reference to be used in functions such as in_wchstr(). It includes a string of up to 8 wide characters and its length, an attribute, and a color-pair.
#define CURSES_CCHAR_MAX 8
struct cchar_t {
attr_t attributes; /* character attributes */
unsigned elements; /* number of wide characters in vals[] */
wchar_t vals[CURSES_CCHAR_MAX]; /* wide characters including non-spacing */
}
Note, we don't define the color-pair, because it is the __COLOR part of the attribute,
and can be extracted with COLOR_PAIR() macro.
In order to handle wide character input from the terminal in get_wch(), we need to add a circular buffer in the screen structure to keep the array of input characters so that correct wide as well as narrow characters can be return properly by get_wch().
struct __screen {
...
#define MAX_CBUF_SIZE MB_LEN_MAX
int cbuf_head; /* header pointer to the circular input character buffer */
int cbuf_tail; /* tail pointer to the circular input character buffer */
int cbuf_cur; /* pointer to the current character in the buffer */
mbstate_t sp; /* wide character input processing state */
int cbuf[ MAX_CBUF_SIZE ]; /* input character buffer */
}
With the modified internal character storage data structure, the new wide character routines as well as the routines that assume the narrow characters and use the old storage data storage structure must be written or rewritten to make add, insert, input, delete, refresh, and cursor movement operations work properly. The major change is to (re)write the code to use the character width instead of assume one character per position. All the wide character routines return an error message if "HAVE_WCHAR" is not defined; similarly, all narrow character routines return an error when "HAVE_WCHAR" is defined. One good news is that all the underlying support routines already exist in the current NetBSD, which are defined in $SRC/include/wchar.h, although some of them do not yet have a man page.
if wcwidth( ch ) == 0
locate the current storage struct *lp
add ch to the non-spacing characters list of lp
return
locate the current storage struct *lp
clear the columns before current cursor location
if wcwidth( ch ) > remaining space on the line
clear to the end of line
lp = struct of the first caharacter of next line
add ch in wcwidth( ch ) columns starting at lp
clear the remaining columns of the 2nd character overwritten
advance the cursor accordingly
compute number of spacing character in str (len)
if n != -1 && len > n
truncate it to keep just the first n characters
len = n
locate the current storage struct *lp
if str has non-spacing characters only (len == 0 )
add the non-spacing characters to the current spacing character under cursor
return OK
skip leading non-spacing characters
clear the columns of the partial character before the cursor
while there are characters in str
c = current character
if ( wcwidth( c ) == 0 )
add c to the non-spacing character list
else
if wcwidth( c ) > remaining space on the line
clear to the end of line
return OK
update *lp and the next wcwidth( c ) - 1 columns
move lp to the next position
move to the next character in str
compute number of spacing character in str (len)
if ( n != -1 && len > n ) {
truncate it to keep just the first n characters
len = n
}
while there are characters in str {
c = current character
create a single wide char string wc with c
setcchar( &cc, wc )
add_wch( cc )
}
add_wch( ch )
refresh()
if ch is a non-spacing character
return add_wch( ch )
locate the current storage struct (s)
right shift all columns from (s + rem) to end-of-line by (cw - rem) columns
clear the partial character column at the end-of-line
update *s and the following (cw - 1) columns
if string starts with a non-spacing character
return ERR
compute total width (w) and toal number of spacing acharacters (len) of str
if ( n > 0 && len > n )
truncate the string to n wide characters
len = n
locate the current storage struct (s)
rem = partial character columns to the end of current character
right shift all columns from (s + rem) to end-of-line by (w - rem) columns
clear the partial character column at the end-of-line
update *s and the following (w - 1) columns with str
for ( ;; ) {
switch ( state ) {
case NORM:
read a character into cbuf[ cbuf_tail ]
cbuf_cur = cbuf_tail = ( cbuf_tail + 1 ) % MAX_CBUF_SIZE
state = ASSEMBLING
break
case BACKOUT:
get the character from cbuf[ cbuf_cur ]
cbuf_cur = ( cbuf_cur + 1 ) % MAX_CBUF_SIZE
if no more character in cbuf
state = ASSEMBLING
break
case ASSEMBLING:
read a character c
if ( EOF )
if cbuf is empty
continue
else
get the character from cbuf[ cbuf_cur ]
state = TIMEOUT
else
put c in cbuf[ cbuf_cur ]
cbuf_tail = cbuf_cur = ( cbuf_cur + 1 ) % MAX_CBUF_SIZE
break
case WC_ASSEMBLING:
read a character
if ( EOF )
if cbuf is empty
continue
else
return the first know character
if cbuf is empty
state = NORM
else
state = BACKOUT
else
check for possible wide character sequence with mbrtowc()
if it is a possible sequence
continue
else
return the first known character/key
if cbuf is empty
state = NORM
else
state = BACKOUT
default:
return ERR;
}
if state == TIMEOUT or there is no matching
mblen = cbuf_tail < cbuf_cur ?
MAX_CBUF_SIZE - cbuf_cur : cbuf_tail - cbuf_cur
ret = mbrtowc( &wc, cbuf[ cbuf_cur ], mb_len, &sp )
switch ( ret ) {
case >= 0
remove the wide character sequence from cbuf
break
case -1
return the first known character
break
case -2
cbuf_cur = ( cbuf_cur + mb_len ) % MAX_CBUF_SIZE
state = WC_ASSEMBLING
if cbuf is empty
state = NORM
else
state = BACKOUT
}
else
if key_entry[ c ] is not a leaf
move to the next key_entry
else
return the function key
if cbuf is empty
state = NORM
else
state = BACKOUT
wc = key_entry value
return KEY_CODE_YES
}
if echoing is enabled {
if wc is [LEFT] or [BS] or [DEL]
move cursor back
delch()
else
add_wch( wc )
}
if the window has been moved or modified
refresh()
return wc
n = 0
while ( ret = get_wch( wc ) != WEOF ) {
if ret == KEY_CODE_YES
restore terminal settings
continue
if wc is a end-of-line or a newline character
wstr[ n ] = NULL
return wc
if wc is an erase or kill character
n = n > 0 ? n-- : 0
else
wstr[ n++ ] = wc
}
locate the current storage struct lp
return lp->ch
locate the storage struct of the current location (s)
locate the storage struct of the window edge (e)
check the length of string from s to e
if n > 0 and the length is longer than n
e = s + n
fp = s, cp = chstr
while ( fp != e ) {
*cp = fp->ch
cp++, fp++
}
*cp = NULL
locate the storage struct of the current location (s)
locate the storage struct of the window edge (e)
check the length of string from s to e
if n > 0 and the length is longer than n
e = s + n
fp = s, cp = chstr
while ( fp != e ) {
*cp = fp->ch
*cp |= attributes | color-pair
cp++, fp++
}
*cp = NULL
ungetwc( wc, _cursesi_screen->infd )
locate the storage struct of the current location (e)
t2 = e, t1 = e + 1
while ( t1 is not the last character on the line ) {
copy *t1 to *t2
t1--, t2--
}
fill the rest of line with a blank character string
get character value ch
get attributes a
get color-pair c from attribute
assemble a complex character from (ch, a, c)
get character value ch
get attributes a
add color-pair c to attribute
assemble a wide character with (ch, a)
I would try to make the implementation as fast as possible, so that it will be usable on machines like vax and m68k. I will also try to make it use the smallest possible memory, for these types of machine. For example, I will try to make line comparison as efficient as possible. We will increase the hash size in __line and refresh() to include non-spacing characters. As an additional goal, I would try to make the wide character support as a compile time option, so that it could be omitted on boot media for small memory systems.
We will test our new library with a simple file viewer. The test script is borrowed from the ncurses test suite (view.c). Some modifications are made to make it specifically for test the new wide character functions. We will see if it work properly with wide character files.
You are welcome to check out the current source code. Just don't forget to send me your bug reports (if any) and comments, either by email or in my blog. To checkout the current source code,
|
|
|