NetBSD-SoC: Wide Character Support in NetBSD curses Library

What is it?

This project is part of the Google Summer of Code program to promote the open source software development. I am very glad that my proposal was approved. Thanks to Google's Summer of Code Program. I will make every effort to make this project a success.

NetBSD is "a free, secure, and highly portable Unix-like Open Source operating system available for many platforms, from 64-bit Opteron machines and desktop systems to handheld and embedded devices. Its clean design and advanced features make it excellent in both production and research environments, and it is user-supported with complete source." For more information, please visit NetBSD official web site.

Status

Projected milestones:

2005-09-01: Deadline for all student work (pencils down)
2005-08-31: All documents are ready

projcet report

DDJ

2005-08-24: Testing done, start documentation

2005-07-31: Coding finished, start testing phase
2005-07-08: Design document done, start coding phase
2005-06-24: Project accepted, started design phase

I will put any related progress reports and discussion on my blog. Please leave your comments/suggestions there. I appreciate it.

Project Deliverables

Mandatory (must-have) components

Working basic input/output routines

Status: Finished documentation.
To do: More tests with other wide character locales. Final report converted as a longer article for Daemon News.

All other input/output routines in the first four categories

Status: Finished documentaion.
To do: More tests with other wide character locales. Final report converted as an article for Daemon News.

Optional (would-be-nice) components

Status: Finished documentation.
To do: More tests with other wide character locales. Final report converted as an article for Daemon News.

Documentation

Integration into NetBSD

The curses wide character support comes as patches to the NetBSD-current distribution. The patches are generated from a baseline version of NetBSD-current source checked out on June 28, 2005. To integrate it, just run patch command to merge the changes into your current source.
Manual Page Addition

If we can get permission from The Open Group to use the Single Unix Specification (SUS) in NetBSD, it is great because we don't have to write a whole new set of man pages from scratch.

If we can't get the right permission, the changes to the curses manual pages should be minor too. Simple changes to existing curses(3) man pages, indicating availability of wide character support. Some newly added functions specified in the X/Open Reference will be included in individual man pages, such as [mv][w]add_wch() in curses_addch(3). In addition, a new option may be added to the "SYNOPSIS" section to enable/disable wide character support.
Testing

I will start with unit tests for individual functions that are added or modified. Then, we will use a simple file viewer borrowed from the ncurses package test suite. Some modifications are made to suit our needs. We will see if it works properly with wide character files. Screenshots of the latest tests can be found here: Simplified Chinese, Traditional Chinese, Japanese, and more are coming soon...
Memory usage: Using the simple file viewer, I compared the memory footprint of different curses libraries, including new NetBSD curses library with wide character support (wcurses), traditional NetBSD curses library, ncurses library with and without wide character support. I use ps(1) to make simple relative comparisons of the same file viewer code linked against different curses libraries. The tests use the same file viewer source code that can call either narrow character functions as a narrow character viewer (ccview and ncview) or wide character functions as a wide character viewer (wcview and nwview). For the wide character tests, the two viewers open the same Chinese locale text that spans multiple pages; for the narrow character tests, the two viewer open the their own source code (view.c). The tests are run on an i386 machine running NetBSD 2.0. Here are the results:
Wide character viewers with wide character files
wcview nwview
SIZE 2152K 1308K
RES 2984K 2164K
- wcview: the file viewer linked with the new curses library with wide character support (HAVE_WCHAR)
- nwview: the file viewer linked with ncurses library built with wide character support enabled
- ccview: the file viewer linked with the new curses library without wide character support
- ncview: the file viewer linked with ncurses library built with wide character support
All viewers with narrow character files
wcview nwview tcview ncview
SIZE 1440K 1128K 504K 328K
RES 2496K 1964K 1108K 1128K
- wcview: file viewer linked with new library with wide character support
- nwview: file viewer linked with ncurses library with wide character support
- tcview: file viewer linked with existing curses library
- ncview: file viewer linked with ncurses library without wide character support
References
- NetBSD Library Functions Manual, curses(3)
- Screen Updating and Cursor Movement Optimization: A Library Package, by Kenneth Arnold, Elan Amir. (The incomplete HTML version)
- X/Open Curses Reference
- GNU ncurses

*Wide character viewers with wide character files*
	wcview	nwview
SIZE	2152K	1308K
RES	2984K	2164K

*All viewers with narrow character files*
	wcview	nwview	tcview	ncview
SIZE	1440K	1128K	504K	328K
RES	2496K	1964K	1108K	1128K

Technical Details

Current NetBSD native curses library

The lack of support for wide characters in the NetBSD implementation of curses library limits the support for internationalized character sets, and thus limits the uses of NetBSD in countries using wide character sets. In particular, the problem lies in the NetBSD curses internal character storage and the related routines assume a 8-bit character in each position. To add wide character support, we need to add a new storage and a new set of wide character routines as specified in the X/Open Reference.

Storage structure: Each character is stored in a structure __ldata as defined in $SRC/lib/libcurses/curses_private.h.
```
    struct __ldata {
        wchar_t ch;         /* Character */
        attr_t  attr;       /* Attributes */
        wchar_t bch;        /* Background character */
        attr_t  battr;      /* Background attributes */
    };
    
```
This storage structure initially was designed for narrow characters. It does not support wide characters although it uses 32-bit wchar_t and attr_t data types because because it assumes one character per position, and there is no character width field. In addition, it does not support non-spacing characters as specified in the X/Open Reference. Therefore, changes must be made to the storage structure to enable wide characters as well as non-spacing characters.
Narrow-character specific routines: The current curses routines assume a character is 8-bit and takes one position on a screen. We will add new routines specified in the X/Open Reference that handle wide characters. These functions generally have a "wch" in their names, with some exceptions that include wchar. These added new routines will only have effects if a compiler option "HAVE_WCHAR" is defined, or an error message is returned. These functions are listed below in three categories:
In addition to these newly added functions, some existing curses routines are shared by both wide character and narrow characters, but assume a 8-bit character per location. In addition, some functions directly access the storage cell when they add, insert, or delete, instead of using addch(), delch() or insch(). These functions should be identified and modified so that they work well with wide character routines. These functions are listed in the following categories:
1. Add
2. Input
3. Delete
4. Refresh and cursor movement
5. Window related
There is a good chance that other routines should also be in this list. Please let me know if you find one.

Proposed changes

Modifying storage structure

In order to add support for wide characters in the NetBSD curses library, the curses internal storage of characters and attributes (as defined in $SRC/lib/libcurses /curses_private.h) needs to be modified. In particular, one possible data structure (__ldata) describing each character position could contain the following:

    struct nschar_t {
        wchar_t           ch;   /* Non-spacing character */
        struct nschart_t *next; /* Next non-spacing character */
    };
    struct __ldata {
        wchar_t   ch;         /* Character */
        attr_t    attr;       /* Attributes */
        wchar_t   bch;        /* Background character */
        attr_t    battr;      /* Background attributes */
        nschar_t *nsp;        /* Foreground non-spacing character pointer */
        nschar_t *bnsp;       /* Background non-spacing character pointer */
    };

In this storage structure, both the character value and attribute are 32 bits. Such a data structure is generic enough to handle all wide character as well as non-spacing characters. Besides, it should align nicely on 32-bit and 64-bit machines. We don't have a character width field to save some memory because the width only needs at most 3 bits. Instead, we use part of the attribute to specify the width. In particular, we currently use values in 0x03ffff00 as the standard attributes. We could use the top bits (0xfc000000) to store the width. The narrow character routines could mask off the wide attributes part (input/output) and for the width to be one (input). On each line, there is one storage cell for each column. For a m-column wide character, only the first storage cell hold the width of the character, and the rest m-1 storage cells hold the position information in their width fields. For example, if a 4-column character is added to the screen, 4 storage cells will be changed and their character width would get the contents (4, -1, -2, -3). Then, if we later come to add a character at the 2nd, 3rd or 4th position, we know we are in the middle of a multi-column character and can easily clear the other cells. In addition to non-spacing characters in foreground, the X/Open reference also indicates that the background characters can also include non-spacing characters, so we must have two linked lists for non-spacing characters in each cell.

There are concerns of more than quadruply use of memory using this new storage structure. First of all, the memory increase is not so large, because the current character storage already uses wchat_t and an integer type attr_t. The only possible additional memory comes from the linked list of non-spacing characters, which do occur infrequently, and can be limited by an implementation to as low as five. So, the memory increase is at most 2.75 times the current space in the worst case, assuming every character has five non-spacing character associated and the uses of 32-bit pointers. In the most common cases, the memory uses only increase by 25%. On the other hand, such increase is an inevitable price to pay if we want to enable wide character support in curses library. We can provide storage structures for both narrow and wide characters, and make it a compile time option ("HAVE_WCHAR") such that the curses program developers can decide if they want to use the wide character support depending on the memory constraints.

In addition, we need to define the complex character data structure cchar_t, required by the X/Open reference to be used in functions such as in_wchstr(). It includes a string of up to 8 wide characters and its length, an attribute, and a color-pair.

#define CURSES_CCHAR_MAX 8
struct cchar_t {
    attr_t    attributes;             /* character attributes */
    unsigned  elements;               /* number of wide characters in vals[] */
    wchar_t   vals[CURSES_CCHAR_MAX]; /* wide characters including non-spacing */
}

Note, we don't define the color-pair, because it is the __COLOR part of the attribute, and can be extracted with COLOR_PAIR() macro.

In order to handle wide character input from the terminal in get_wch(), we need to add a circular buffer in the screen structure to keep the array of input characters so that correct wide as well as narrow characters can be return properly by get_wch().

struct __screen {
    ...
#define MAX_CBUF_SIZE MB_LEN_MAX
    int       cbuf_head;      /* header pointer to the circular input character buffer */
    int       cbuf_tail;      /* tail pointer to the circular input character buffer */
    int       cbuf_cur;       /* pointer to the current character in the buffer */
    mbstate_t sp;             /* wide character input processing state */
    int       cbuf[ MAX_CBUF_SIZE ]; /* input character buffer */
}

Adding wide character support routines

With the modified internal character storage data structure, the new wide character routines as well as the routines that assume the narrow characters and use the old storage data storage structure must be written or rewritten to make add, insert, input, delete, refresh, and cursor movement operations work properly. The major change is to (re)write the code to use the character width instead of assume one character per position. All the wide character routines return an error message if "HAVE_WCHAR" is not defined; similarly, all narrow character routines return an error when "HAVE_WCHAR" is defined. One good news is that all the underlying support routines already exist in the current NetBSD, which are defined in $SRC/include/wchar.h, although some of them do not yet have a man page.

Add (Overwrite)

[mv][w]add_wch(): add a wide character

waddch()

            if wcwidth( ch ) == 0
                locate the current storage struct *lp
                add ch to the non-spacing characters list of lp
                return
            locate the current storage struct *lp
            clear the columns before current cursor location
            if wcwidth( ch ) > remaining space on the line
                clear to the end of line
                lp = struct of the first caharacter of next line
            add ch in wcwidth( ch ) columns starting at lp
            clear the remaining columns of the 2nd character overwritten 
            advance the cursor accordingly

[mv][w]add_wch[n]str(): add an array of wide characters

            compute number of spacing character in str (len)
            if n != -1 && len > n
                truncate it to keep just the first n characters
                len = n
            locate the current storage struct *lp
            if str has non-spacing characters only (len == 0 )
                add the non-spacing characters to the current spacing character under cursor
                return OK
            skip leading non-spacing characters
            clear the columns of the partial character before the cursor
            while there are characters in str
                c = current character
                if ( wcwidth( c ) == 0 )
                    add c to the non-spacing character list
                else
                    if wcwidth( c ) > remaining space on the line
                        clear to the end of line 
                        return OK
                    update *lp and the next wcwidth( c ) - 1 columns
                    move lp to the next position
                move to the next character in str

[mv][w]add[n]wstr(): add a wide character string and advance the cursor

            compute number of spacing character in str (len)
            if ( n != -1 && len > n ) {
                truncate it to keep just the first n characters
                len = n
            }
            while there are characters in str {
                c = current character
                create a single wide char string wc with c
                setcchar( &cc, wc )
                add_wch( cc )
            }

[w]echo_wchar(): write a wide character and immediately refresh
pecho_wchar(): write a wide character and refresh the pad immediately

            add_wch( ch )
            refresh()

Insert

[mv][w]ins_wch(): insert a wide character in a window

            if ch is a non-spacing character
                return add_wch( ch )
            locate the current storage struct (s)
            right shift all columns from (s + rem) to end-of-line by (cw - rem) columns
            clear the partial character column at the end-of-line
            update *s and the following (cw - 1) columns

[mv][w]ins_[n]wstr(): insert a wide character string in a window

            if string starts with a non-spacing character
                return ERR
            compute total width (w) and toal number of spacing acharacters (len) of str 
            if ( n > 0 && len > n )
                truncate the string to n wide characters
                len = n
            locate the current storage struct (s)
            rem = partial character columns to the end of current character
            right shift all columns from (s + rem) to end-of-line by (w - rem) columns
            clear the partial character column at the end-of-line
            update *s and the following (w - 1) columns with str

Input (Read back from window)

[mv][w]get_wch(): get a wide character

            for ( ;; ) {
                switch ( state ) {
                    case NORM:
                        read a character into cbuf[ cbuf_tail ]
                        cbuf_cur = cbuf_tail = ( cbuf_tail + 1 ) % MAX_CBUF_SIZE
                        state = ASSEMBLING
                        break
                    case BACKOUT:
                        get the character from cbuf[ cbuf_cur ]
                        cbuf_cur = ( cbuf_cur + 1 ) % MAX_CBUF_SIZE
                        if no more character in cbuf
                            state = ASSEMBLING
                        break
                    case ASSEMBLING:
                        read a character c
                        if ( EOF )
                            if cbuf is empty
                                continue
                            else
                                get the character from cbuf[ cbuf_cur ]
                                state = TIMEOUT
                        else
                            put c in cbuf[ cbuf_cur ]
                            cbuf_tail = cbuf_cur = ( cbuf_cur + 1 ) % MAX_CBUF_SIZE
                        break
                    case WC_ASSEMBLING:
                        read a character 
                        if ( EOF )
                            if cbuf is empty
                                continue
                            else
                                return the first know character
                                if cbuf is empty
                                    state = NORM
                                else
                                    state = BACKOUT
                        else
                            check for possible wide character sequence with mbrtowc()
                            if it is a possible sequence
                                continue
                            else
                                return the first known character/key
                                if cbuf is empty
                                    state = NORM
                                else 
                                    state = BACKOUT
                    default:
                        return ERR;
                }
                if state == TIMEOUT or there is no matching
                    mblen = cbuf_tail < cbuf_cur ?  
                        MAX_CBUF_SIZE - cbuf_cur : cbuf_tail - cbuf_cur
                    ret = mbrtowc( &wc, cbuf[ cbuf_cur ], mb_len, &sp )
                    switch ( ret ) {
                        case >= 0 
                            remove the wide character sequence from cbuf
                            break
                        case -1 
                            return the first known character
                            break
                        case -2 
                            cbuf_cur = ( cbuf_cur + mb_len ) % MAX_CBUF_SIZE
                            state = WC_ASSEMBLING
                        if cbuf is empty
                            state = NORM
                        else
                            state = BACKOUT
                    }
                else
                    if key_entry[ c ] is not a leaf
                        move to the next key_entry
                    else
                        return the function key 
                        if cbuf is empty
                            state = NORM
                        else
                            state = BACKOUT
                        wc = key_entry value
                        return KEY_CODE_YES
            }
            if echoing is enabled {
                if wc is [LEFT] or [BS] or [DEL] 
                    move cursor back
                    delch()
                else
                    add_wch( wc )
            }
            if the window has been moved or modified
                refresh()
            return wc

[mv][w]get[n]_wstr(): get an array of wide characters and function keys from a terminal

            n = 0
            while ( ret = get_wch( wc ) != WEOF ) {
                if ret == KEY_CODE_YES 
                    restore terminal settings
                    continue
                if wc is a end-of-line or a newline character
                    wstr[ n ] = NULL
                    return wc
                if wc is an erase or kill character
                    n = n > 0 ? n-- : 0
                else
                    wstr[ n++ ] = wc
            }

[mv][w]in_wch(): extract a wide character from a window

            locate the current storage struct lp
            return lp->ch

[mv][w]in[n]wstr(): input a wide character string from a windows

            locate the storage struct of the current location (s)
            locate the storage struct of the window edge (e)
            check the length of string from s to e
            if n > 0 and the length is longer than n
                e = s + n
            fp = s, cp = chstr
            while ( fp != e ) {
                *cp = fp->ch
                cp++, fp++
            }
            *cp = NULL

[mv][w]in_wch[n]str(): extract an array of wide characters from a window

            locate the storage struct of the current location (s)
            locate the storage struct of the window edge (e)
            check the length of string from s to e
            if n > 0 and the length is longer than n
                e = s + n
            fp = s, cp = chstr
            while ( fp != e ) {
                *cp = fp->ch
                *cp |= attributes | color-pair
                cp++, fp++
            }
            *cp = NULL

unget_wch(): push a wide character onto the input queue

ungetc()

ungetwc()

            ungetwc( wc, _cursesi_screen->infd )

Delete

            locate the storage struct of the current location (e)
            t2 = e, t1 = e + 1
            while ( t1 is not the last character on the line ) {
                copy *t1 to *t2
                t1--, t2--
            }
            fill the rest of line with a blank character string

Complex character processing

getcchar(): get a wide character string and rendition from a cchar_t

            get character value ch
            get attributes a
            get color-pair c from attribute
            assemble a complex character from (ch, a, c)

setcchar(): set cchar_t from a wide character string and rendition

            get character value ch
            get attributes a
            add color-pair c to attribute
            assemble a wide character with (ch, a)

Refresh

_cursesi_wnoutrefresh(): refresh a screen
makech(): make a change on screen
quickch(): optimize changes of a window
plod(): move cursor to destination

Window related

_newwin(): create a new window buffer
wclrtobot(): erase everything on a window
wclrtoeol(): clear up to the end of line
werase(): erase everything on a window
copywin(): copy content from a window to another

Performance goal

I would try to make the implementation as fast as possible, so that it will be usable on machines like vax and m68k. I will also try to make it use the smallest possible memory, for these types of machine. For example, I will try to make line comparison as efficient as possible. We will increase the hash size in __line and refresh() to include non-spacing characters. As an additional goal, I would try to make the wide character support as a compile time option, so that it could be omitted on boot media for small memory systems.

Testing

We will test our new library with a simple file viewer. The test script is borrowed from the ncurses test suite (view.c). Some modifications are made to make it specifically for test the new wide character functions. We will see if it work properly with wide character files.

Manual pages changed (if we can not reuse SUS)

curses.3: added an compiler option to enable/disable wide character support, and the indices to the wide character specific routines.
curses_addch.3: added [mv][w]add_wch().
curses_addchstr.3: added [mv][w]add_wch[n]str().
curses_addstr.3: added [mv][w]add[n]wstr().
curses_inch.3: added [mv][w]in_wch(), [mv][w]in[n]wstr() and [mv][w]in_wch[n]str().
curses_input.3: added [mv][w]get_wch() and [mv][w]get[n]_wstr().
curses_insertch.3: added [mv][w]ins_wch().
curses_insstr.3: added [mv][w]ins[n]str() and [mv][w]ins_wch(). (New from X/Open Reference)

Source code

Added new files

Add (Overwrite)

Insert

Input (Read back from window)

Test

Modified files

Common

Input

Refresh and cursor movement

Window related

Miscellaneous

You are welcome to check out the current source code. Just don't forget to send me your bug reports (if any) and comments, either by email or in my blog. To checkout the current source code,

cvs -d:pserver:[email protected]:/cvsroot/netbsd-soc login (Password: just press ENTER)
cvs -z3 -d:pserver:[email protected]:/cvsroot/netbsd-soc co -P wcurses
See the README.NetBSD file for building and usage instructions

The sources code can also be viewed using a web interface.

Ruibiao Qiu <[email protected]>

$Id: index.html,v 1.29 2005/09/21 14:51:00 ruibiao Exp $

NetBSD-SoC: Wide Character Support in NetBSD curses Library

What is it?

Status

Project Deliverables

Mandatory (must-have) components

Optional (would-be-nice) components

Documentation

Integration into NetBSD

Manual Page Addition

Testing

References

Technical Details

Current NetBSD native curses library

Proposed changes

Modifying storage structure

Adding wide character support routines

Performance goal

Testing

Manual pages changed (if we can not reuse SUS)

Source code