Adding UTF-8 support to Linux console

Disclaimer
Lecture about the source code of the Linux input subsystem
Lecture purpose: to implement a UTF-8 compatible subsystem
- What it is needed
Methodology
Holistic workflow
- Files involved
Code Analysis
Some thought on how to solve my problem
- Cannot be solved only in keyboard
- From the lowest level: input
Status
Resources

Disclaimer

This is an ongoing project, so do not expect any practical solution yet, I may be working on it, or it may event require much more time. check status to know how it is going.

Lecture about the source code of the Linux input subsystem

This is a code analysis of the Linux code and also of my own code. The purpose is to help and teach new Linux kernel developers how all of this is done, the help comes from me trying to demystified code that already exists in the kernel and also my changes. I am by no means a full blown kernel developer but I believe that I have enough experience to implement UTF-8 support.

Lecture purpose: to implement a UTF-8 compatible subsystem

I use keyd (a key mapping program), a program that makes virtual keyboards whose purpose is inhibit and substitute actual keyboards, it receives the inputs of the designated keyboards and then sends the user mapped keys to console.

keyd does not work quite well with UTF-8 keys, since Linux is not UTF-8 aware. In Linux, each program has to be aware of UTF-8 or another program has to do it from them, E.g. console, which has a internal mechanism to understand UTF-8, for the former scenario; and X11 for the later scenario. Console can only show UTF-8, nonetheless it cannot convert input from input to UTF-8, at least no in the way that keyd does; notice that it can do it but through the keymap feature, which requires compile a new keymap each time we want to change a mapped key, thus we cannot dynamically change the keys.

What it is needed

Modify the input and keyboard subsystem and driver respectively to allow keycodes bigger that 1 byte.
Modify keyd to make UTF-8 keyboards.
Test
If a new key input type is added, check that the notifiers get a consistent value.
[ ] Bonus: Test or implement the chages with real keyboards (ones with QMK firmware).

Methodology

I will just describe the code that I think it is important, that is any code that does not immediate makes sense without knowledge of the kernel.

I made this org mode format, and it will be converted to HTML. Please do not make any assumption that the org mode format can be treated as literate programming, the code within the org file can be converted to a C file/header with ORG's literate programming functionality, nonetheless the generated code will not be completed or useful by itself. The Linux kernel is free software, you can find it at https://kernel.org/, there you will find the complete implementation of what the code that I am about to analyze.

I will omit some parts of the code for the sake of this page size, but also because I am not interested is some branches of the code.

Holistic workflow

The workflow is the following:

The input subsystem is loaded and waits for other drivers to use it.
The drivers wait for any device in the load table (mod_devicetable.h).
Then the input subsystem is loaded with all the information (handlers, filters, and so on) for the device matched by a driver that uses the input subsystem.
Input polls the devices' char devices and send the input to the respective userland.
The inputs are queued to console's queue.
The queue is processed by console, reads an process the inputs. If console is ruining a program, something like X11, it will send the input to it.

Files involved

drivers/input/* (specially input.c): input itself.
drivers/input/misc/uinput.c: logic for userspace virtual inputs.
drivers/tty/vt/keyboard.c: Driver that uses input, which is part of console.
include/dt-bindings/input/input/linux-event-codes.h: location of the keycodes.

input_handle_event -> input_event_dispose -> input_pass_values ->
input_to_handler -> handler->event (kbd_event) -> kbd_keycode -> a key
type handler.                      |                      ^
                                   |                      |
                                   |                      |
                                   -> or uinput_dev_event -

The previous code snippet (which is just a plain text graph) described a rough approximation of the input subsystem workflow, I am ignoring the plumbing mechanisms used by the kernel, this is just a view from the subsystem itself.

Code Analysis

Overview

This was the last paragraph I wrote, I would not be able to provide a whole overview without reading and understanding the code first, It is likely that I will use concepts or terms in this overview that later will be better explained.

the input subsystem works based on the abstraction of keyboard (kbd), which means that it translate events to keys and not to the respective values of key, let me explain… in linux-event-codes.h are found the values of the keys, which are not represented by the ASCII values of the key (Key A != ascii A), which makes sense (at least for analog inputs like PS/2, I believed this abstraction would not be needed for USB keyboards, however, implemented USB for an OS could be a daunting task. There is a physical limitation for the electrical signaling of the keyboards, thus there is not (perhaps due to historical reasons) a match of the eponymous key with its ASCII equivalent.

At this point I already found a solution, I tested it only in console (the linux console implementation), I still have to clean the code and send the patch (maybe). There is a catch, it is a internal subsystem, used by a lot of drivers and userland programs (through the header), I am not quite sure if something could get broken, like X11, which I guess it uses this subsystem quite exhaustively.

Another way would be without modifying the main subsystem, which means, making a new subsystem as a module. I dislike this, since it requires duplication of the input subsystem code, nonetheless I could make it work with only the extreme minimum amount of code. I will see how to do it :).

I stumbled upon this in the QMK official documentation:

UNICODE_KEY_LNX | LCTL(LSFT(KC_U)) | The key to tap when beginning a Unicode sequence with the Linux input mode

Which made me think that there may be a way to input unicode characters at kernel level. After, somewhat exhaustively, checking the source code I noticed that only works at userland level with programs known as IMF (Input method framework) that use or are configured to use left control + left shift + u key combination to active themselves.

input.c

input_handle_event

This is the main handler of this subsystem, almost all the other drivers indirectly call this function through input_event. It is kinda the first filter. It checks, through get_input_disposition, if the input is valid though.

void input_handle_event(struct input_dev *dev,
                        unsigned int type, unsigned int code, int value)
{
        int disposition;

        lockdep_assert_held(&dev->event_lock);

        disposition = input_get_disposition(dev, type, code, &value);
        if (disposition != INPUT_IGNORE_EVENT) {
                if (type != EV_SYN)
                        add_input_randomness(type, code, value);

                input_event_dispose(dev, disposition, type, code, value);
        }
}

input_get_disposition

This is the main check, it verifies the values are within the bounds, otherwise it returns with a value that denotes this event should be dropped (never queued). I just added the code that I am focusing, therefore I am just reviewing the EV_KEY case.

static int input_get_disposition(struct input_dev *dev,
                                 unsigned int type, unsigned int code, int *pval)
{
  int disposition = INPUT_IGNORE_EVENT;
  int value = *pval;

  /* filter-out events from inhibited devices */
  if (dev->inhibited)
    return INPUT_IGNORE_EVENT;

  switch (type) {
    /* [...] */
  case EV_KEY:

is_event_supported just checks if the key code is allowed to be transmitted from the device, the upper bound in this case is KEY_MAX (0x2ff). dev is a struct input_dev and the keybit is a bitmap, each bit represents a key, keybit field is defined unsigned long keybit[BITS_TO_LONGS(KEY_CNT)], thus it is an array of 96 uints, holding up to possible 768 keys/bits.

Looks like that is the current amount of keys, it could get bigger I guess, but I do not think it will happen, a lot of keys could be reused, at least that is what I think. bits i)

    if (is_event_supported(code, dev->keybit, KEY_MAX)) {

      /* auto-repeat bypasses state updates */
      if (value == 2) {
        disposition = INPUT_PASS_TO_HANDLERS;
        break;
      }

      if (!!test_bit(code, dev->key) != !!value) {

        __change_bit(code, dev->key);
        disposition = INPUT_PASS_TO_HANDLERS;
      }
    }
    break;
    /* [...] */
  }

  *pval = value;
  return disposition;
}

keyboard.c

KBDMODES

There are different modes for console keyboards, depending the mode that is how the keycode will be translated or not for console usage. The following are the modes:

#define VC_XLATE	0	/* translate keycodes using keymap */
#define VC_MEDIUMRAW	1	/* medium raw (keycode) mode */
#define VC_RAW		2	/* raw (scancode) mode */
#define VC_UNICODE	3	/* Unicode mode */
#define VC_OFF		4	/* disabled mode */

I am quite sure that almost all the keyboards are handled as unicode, or least that was the one that I used for debugging and also the one made with uinput. However, it is quite useful to know the other modes.

VC_XLATE: it will translate the key with the internal console mechanism. Pretty much it is the same as VC_UNICODE, since both modes use the console keymaps, but UNICODE has additional processes.
VC_RAW: it only allows to send values from 0 up to 0x7f I.e. 127.
VC_MEDIUMRAW: almost like raw, it resembles to RAW when keycode =< 0x7f, otherwise it extends raw with an additional byte.
VC_UNICODE: it translates the keycode to UNICODE, it does not expected well-formatted unicode, the keyboard machinery does that, the kernel never expects a UNICODE input. I think that it is uncommon to see a keyboard that send unicode unless it is a QMK keyboard, I may be biased, since keyboards from other countries could send unicode.
VC_OFF: I could find any relevant information about this. Looks like under some circumstances keyboard will ignore some keys if this is activated.

Key code handlers

These symbols are defined as follows:

#define K_HANDLERS\
        k_self,		k_fn,		k_spec,		k_pad,\
        k_dead,		k_cons,		k_cur,		k_shift,\
        k_meta,		k_ascii,	k_lock,		k_lowercase,\
        k_slock,	k_dead2,	k_brl,		k_ignore

static k_handler_fn *k_handler[16] = { K_HANDLERS };

Each element is a function that will be used to put the symbols on console.

Each position has a macro to access to it, it is defined as:

#define KT_LATIN	0	/* we depend on this being zero */
#define KT_FN		1
#define KT_SPEC		2
#define KT_PAD		3
#define KT_DEAD		4
#define KT_CONS		5
#define KT_CUR		6
#define KT_SHIFT	7
#define KT_META		8
#define KT_ASCII	9
#define KT_LOCK		10
#define KT_LETTER	11	/* symbol that can be acted upon by CapsLock */
#define KT_SLOCK	12
#define KT_DEAD2	13
#define KT_BRL		14

Keymaps

The default keymaps are defined in drivers/tty/vt/defkeymap.c, please notice that this keymap is compiled for a keymap file, thus this is linux default, but you may have different one. There is a map for every possible key state like shift, shift+ctrl, etc. This is a keymap of 256 (NR_KEYS) ushort values because each ushort encodes two datums; the first one can be found in the 2 MSBs, with this can be known which keycode handler will be used; the second one can be found in the 2 LSBs, this datum holds the value of the keycode.

These keymaps are responsible for translating the keycode to a ascii/unicode key symbol. This the definition of the default plain_map:

unsigned short plain_map[NR_KEYS] = {
        0xf200,	0xf01b,	0xf031,	0xf032,	0xf033,	0xf034,	0xf035,	0xf036,
        0xf037,	0xf038,	0xf039,	0xf030,	0xf02d,	0xf03d,	0xf07f,	0xf009,
        0xfb71,	0xfb77,	0xfb65,	0xfb72,	0xfb74,	0xfb79,	0xfb75,	0xfb69,
        0xfb6f,	0xfb70,	0xf05b,	0xf05d,	0xf201,	0xf702,	0xfb61,	0xfb73,
        0xfb64,	0xfb66,	0xfb67,	0xfb68,	0xfb6a,	0xfb6b,	0xfb6c,	0xf03b,
        0xf027,	0xf060,	0xf700,	0xf05c,	0xfb7a,	0xfb78,	0xfb63,	0xfb76,
        0xfb62,	0xfb6e,	0xfb6d,	0xf02c,	0xf02e,	0xf02f,	0xf700,	0xf30c,
        0xf703,	0xf020,	0xf207,	0xf100,	0xf101,	0xf102,	0xf103,	0xf104,
        0xf105,	0xf106,	0xf107,	0xf108,	0xf109,	0xf208,	0xf209,	0xf307,
        0xf308,	0xf309,	0xf30b,	0xf304,	0xf305,	0xf306,	0xf30a,	0xf301,
        0xf302,	0xf303,	0xf300,	0xf310,	0xf206,	0xf200,	0xf03c,	0xf10a,
        0xf10b,	0xf200,	0xf200,	0xf200,	0xf200,	0xf200,	0xf200,	0xf200,
        0xf30e,	0xf702,	0xf30d,	0xf01c,	0xf701,	0xf205,	0xf114,	0xf603,
        0xf118,	0xf601,	0xf602,	0xf117,	0xf600,	0xf119,	0xf115,	0xf116,
        0xf11a,	0xf10c,	0xf10d,	0xf11b,	0xf11c,	0xf110,	0xf311,	0xf11d,
        0xf200,	0xf200,	0xf200,	0xf200,	0xf200,	0xf200,	0xf200,	0xf200,
};

Using the previous keymap we can obtain the value of the keycode 20, which is the key T as defined in include/dt-bindings/input/linux-event-codes.h. Therefore, we infer that the value of T is in the position 20 of this map (if there was not any shift mask), that position holds the value 0xfb74. 0xfb - 0xf0 is the keycode handler that will be used, KT_LETTER, and 0x74 is the key symbol which is t.

kbd_keycode

It is in charge of trying to find the true keycode value that will be delivered to console, it also notifies all the subscribers of this subsystem [2] about the new input that has been received.

vc is declared as the data structure of the current console, which is the device that will receive the keyboard input, it depends of fg_console, which is declared in drivers/tty/vt/vt.c.

param will be, after some possible transformation, the value that the subscribers will received.

static void kbd_keycode(unsigned int keycode, int down, bool hw_raw)
{
  struct vc_data *vc = vc_cons[fg_console].d;
  unsigned short keysym, *key_map;
  unsigned char type;
  bool raw_mode;
  struct tty_struct *tty;
  int shift_final;
  struct keyboard_notifier_param param = { .vc = vc, .value = keycode, .down = down };
  int rc;

  tty = vc->port.tty;

  if (tty && (!tty->driver_data)) {
    /* No driver data? Strange. Okay we fix it then. */
    tty->driver_data = vc;
  }

kbd represents the current state of the console state machine.

kbd = &kbd_table[vc->vc_num];
/* [...] some spack only config (which I do not need) */
rep = (down == 2);

As described in KBDSMOD, VC_RAW only sends up to 0x7f values to console. This is the implementation. emulate_raw checks if the value is correct and then puts the value in console.

raw_mode = (kbd->kbdmode == VC_RAW);
if (raw_mode && !hw_raw)
        if (emulate_raw(vc, keycode, !down << 7))
                if (keycode < BTN_MISC && printk_ratelimit())
                        pr_warn("can't emulate rawmode for keycode %d\n",
                                keycode);

Then comes the implementation of VC_MEDIUMRAW. Again, explained in KBDMODES.

Notice that the MSB in the first two bytes is used as a flag.

if (kbd->kbdmode == VC_MEDIUMRAW) {
        /*
         * This is extended medium raw mode, with keys above 127
         * encoded as 0, high 7 bits, low 7 bits, with the 0 bearing
         * the 'up' flag if needed. 0 is reserved, so this shouldn't
         * interfere with anything else. The two bytes after 0 will
         * always have the up flag set not to interfere with older
         * applications. This allows for 16384 different keycodes,
         * which should be enough.
         */
        if (keycode < 128) {
                put_queue(vc, keycode | (!down << 7));
        } else {
                put_queue(vc, !down << 7);
                put_queue(vc, (keycode >> 7) | BIT(7));
                put_queue(vc, keycode | BIT(7));
        }
        raw_mode = true;
}

key_down is a bitmap holding up to KEY_CNT bytes (768 bits). It is used to track down all the keys that are being pressed down. Its definition is static DECLARE_BITMAP(key_down, KEY_CNT).

assign_bit(keycode, key_down, down);

The shift_final bitmask is created, it is used to designate which keymap is going to be used. The possible keymaps (key_maps) are defined in drivers/tty/defkeymap.c.

param.shift = shift_final = (shift_state | kbd->slockstate) ^ kbd->lockstate;
param.ledstate = kbd->ledflagstate;
key_map = key_maps[shift_final];

There is a protocol about how the subscribers of this subsystem have to received the inputs, in begins sending the KBD_KEYCODE event, which is defined in include/linux/notifier.h.

rc = atomic_notifier_call_chain(&keyboard_notifier_list,
                                KBD_KEYCODE, &param);

If there is not a appropriate keymap, I.e. when shift_final mask is invalid, it will notify the subscribers that a bad key was sent to keyboard, afterwards it cleans the sticky keys.

if (rc == NOTIFY_STOP || !key_map) {
        atomic_notifier_call_chain(&keyboard_notifier_list,
                                   KBD_UNBOUND_KEYCODE, &param);
        do_compute_shiftstate();
        kbd->slockstate = 0;
        return;
}

keysym will get a value from this expression, it has to be less than NR_KEYS, otherwise it will be out of the upper bound of the keymaps; however, there are some values used for braille devices.

if (keycode < NR_KEYS)
        keysym = key_map[keycode];
else if (keycode >= KEY_BRL_DOT1 && keycode <= KEY_BRL_DOT8)
        keysym = U(K(KT_BRL, keycode - KEY_BRL_DOT1 + 1));
else
        return;

Type holds the position of the keycode handler that will be used, it is a ushort and KTYP just right shifts 8 bits to get the MSB which holds the position.

type = KTYP(keysym);

When type (which is the function position datum) is invalid, it will use KBD_UNICODE. I think that even though I used the term "invalid" it is quite incorrect to use it, it could be deliberately encoded as such to know when UNICODE should be used. If UNICODE, it will notify the subscribers about this event, which is KBD_UNICODE.

if (type < 0xf0) {
        param.value = keysym;
        rc = atomic_notifier_call_chain(&keyboard_notifier_list,
                                        KBD_UNICODE, &param);
        if (rc != NOTIFY_STOP)
                if (down && !raw_mode)
                        k_unicode(vc, keysym, !down);
        return;
}

This will be the last transformation to obtain the position for the handler function.

type -= 0xf0;

Then, based of the actual type, the keycode datum will be formatted accordingly and sent to subscribers and to console. In this case it is checked whether the code is one represented for a ascii letter, if so it will use the k_self handler, which will be used latter to send the key as it is, without any processing.

if (type == KT_LETTER) {
        type = KT_LATIN;
        if (vc_kbd_led(kbd, VC_CAPSLOCK)) {
                key_map = key_maps[shift_final ^ BIT(KG_SHIFT)];
                if (key_map)
                        keysym = key_map[keycode];
        }
}

There are three different notifier events for this subsystem (include/linux/notifier.h): KBD_UNICODE, was used before to put and send the unicode key; KBD_UNBOUND, was also used when the key was not bound to any map; lastly KBD_KEYSYM, which means that we are sending a key value, the representation of the key in ascii.

param.value = keysym;
rc = atomic_notifier_call_chain(&keyboard_notifier_list,
                                KBD_KEYSYM, &param);

Always asserting that the values do not return incorrect codes, otherwise just end the process.

if (rc == NOTIFY_STOP)
        return;

if ((raw_mode || kbd->kbdmode == VC_OFF) && type != KT_SPEC && type != KT_SHIFT)
        return;

This is the actual use of the keycode handler, this function will put the keycode in the console, depending on the type, there will be some last transformation for the key code. Check K_handlers section to know the possible handlers, each of them represent a function, there will be the code to know how the input will be put on the console.

(*k_handler[type])(vc, keysym & 0xff, !down);

Last but no less important, cleaning sticky keys and sending the last part of the KBD_KEYCODE event (Only if the event was KBD_KEYSYM).

param.ledstate = kbd->ledflagstate;
atomic_notifier_call_chain(&keyboard_notifier_list, KBD_POST_KEYSYM, &param);

if (type != KT_SLOCK)
        kbd->slockstate = 0;

Internal utilities

These are small functions used for basic things like predicative testings, E.g. checking if something is an event or not. This functions are "internal" because are found within the subsystem and not provided by any external module/subsystem or utilities functions of the kernel itself.

is_event_supported

It checks if an event is supported, however, it does not checks the event itself but its values. There is a finite set of events, and each has bounds of what values ranges accept, it is checking first if the value is within the bounds, and if the some bit it enabled in bm. Most of the time is used to check if any key or sub-event (like the *bit fields in input_dev) is already activated of expected from a devices' input_dev struct.

static inline int is_event_supported(unsigned int code,
                                     unsigned long *bm, unsigned int max)
{
        return code <= max && test_bit(code, bm);
}

`Notifier` and subscribers

Did you remember that I talk about this API before? It is a internal kernel mechanism to allow other drivers or subsystems to receive event from other driver or subsystem, that is pretty what it is.

Well… input and keyboard have two specific usage for this API. It only uses it to notify Braille and Text-to-speech devices, it provides feedback about keys and/or symbol that was sent to console. Sadly, both of those devices do not have support for the internal key to UTF-8 mechanism that is in the kernel.

Some thought on how to solve my problem

Cannot be solved only in `keyboard`

keyboard uses uint as the type to send and received event codes, this is enough for Unicode. Nonetheless, in only receives 1 byte per event from input, which where the problem lays. Thus I have to modify input to receive up to 4 bytes.

There is this a unicode mechanism built in console, but this is only for keymaps, it is not for inputs. That is way I also exclude using this mechanism already built.

From the lowest level: `input`

I think that this would be the best solution, since is where the primitives events are handled and received. After checking the source code I am somewhat sure that this would not break anything.

This approach would be quite easy to use with uinput, it would only need to enable a new event type bit. What it is concerning is how this is going to work with real keyboards, as far as I know there is no way to enable that bit in physical keyboards, drivers should do that, but it would imply modifying them.

mod_devicetable.h
input-event-codes.h

Status

I have a solution, but it only works when evdev is not used I.e. when keyboard does not relies on evdev as the main event manager. Thus, I am debugging and playing around with evdev to make it work.