Adding UTF-8 support to Linux console
Table of Contents
Disclaimer
This is an ongoing project, so do not expect any practical solution yet, I may be working on it, or it may event require much more time. check status to know how it is going.
Lecture about the source code of the Linux input subsystem
This is a code analysis of the Linux code and also of my own code. The purpose is to help and teach new Linux kernel developers how all of this is done, the help comes from me trying to demystified code that already exists in the kernel and also my changes. I am by no means a full blown kernel developer but I believe that I have enough experience to implement UTF-8 support.
Lecture purpose: to implement a UTF-8 compatible subsystem
I use keyd
(a key mapping program), a program that makes virtual
keyboards whose purpose is inhibit and substitute actual keyboards, it
receives the inputs of the designated keyboards and then sends the
user mapped keys to console
.
keyd
does not work quite well with UTF-8 keys, since Linux is not
UTF-8 aware. In Linux, each program has to be aware of UTF-8 or
another program has to do it from them, E.g. console
, which has a
internal mechanism to understand UTF-8, for the former scenario; and
X11 for the later scenario. Console can only show UTF-8, nonetheless
it cannot convert input from input
to UTF-8, at least no in the way
that keyd
does; notice that it can do it but through the keymap
feature, which requires compile a new keymap each time we want to
change a mapped key, thus we cannot dynamically change the keys.
What it is needed
- Modify the
input
andkeyboard
subsystem and driver respectively to allow keycodes bigger that 1 byte. - Modify
keyd
to make UTF-8 keyboards. - Test
- If a new key input type is added, check that the notifiers get a consistent value.
[ ]
Bonus: Test or implement the chages with real keyboards (ones with QMK firmware).
Methodology
I will just describe the code that I think it is important, that is any code that does not immediate makes sense without knowledge of the kernel.
I made this org mode format, and it will be converted to HTML. Please do not make any assumption that the org mode format can be treated as literate programming, the code within the org file can be converted to a C file/header with ORG's literate programming functionality, nonetheless the generated code will not be completed or useful by itself. The Linux kernel is free software, you can find it at https://kernel.org/, there you will find the complete implementation of what the code that I am about to analyze.
I will omit some parts of the code for the sake of this page size, but also because I am not interested is some branches of the code.
Holistic workflow
The workflow is the following:
- The input subsystem is loaded and waits for other drivers to use it.
- The drivers wait for any device in the load table (mod_devicetable.h).
- Then the input subsystem is loaded with all the information (handlers, filters, and so on) for the device matched by a driver that uses the input subsystem.
Input
polls the devices' char devices and send the input to the respective userland.- The inputs are queued to console's queue.
- The queue is processed by console, reads an process the inputs. If console is ruining a program, something like X11, it will send the input to it.
Files involved
- drivers/input/* (specially input.c):
input
itself. - drivers/input/misc/uinput.c: logic for userspace virtual inputs.
- drivers/tty/vt/keyboard.c: Driver that uses
input
, which is part ofconsole
. - include/dt-bindings/input/input/linux-event-codes.h: location of the keycodes.
input_handle_event -> input_event_dispose -> input_pass_values -> input_to_handler -> handler->event (kbd_event) -> kbd_keycode -> a key type handler. | ^ | | | | -> or uinput_dev_event -
The previous code snippet (which is just a plain text graph) described a rough approximation of the input subsystem workflow, I am ignoring the plumbing mechanisms used by the kernel, this is just a view from the subsystem itself.
Code Analysis
Overview
This was the last paragraph I wrote, I would not be able to provide a whole overview without reading and understanding the code first, It is likely that I will use concepts or terms in this overview that later will be better explained.
the input
subsystem works based on the abstraction of keyboard
(kbd), which means that it translate events to keys and not to the
respective values of key, let me explain… in linux-event-codes.h
are found the values of the keys, which are not represented by the
ASCII values of the key (Key A != ascii A), which makes sense (at
least for analog inputs like PS/2, I believed this abstraction would
not be needed for USB keyboards, however, implemented USB for an OS could be a
daunting task. There is a physical limitation for the electrical
signaling of the keyboards, thus there is not (perhaps due to
historical reasons) a match of the eponymous key with its ASCII
equivalent.
At this point I already found a solution, I tested it only in
console
(the linux console implementation), I still have to clean
the code and send the patch (maybe). There is a catch, it is a
internal subsystem, used by a lot of drivers and userland programs
(through the header), I am not quite sure if something could get
broken, like X11, which I guess it uses this subsystem quite
exhaustively.
Another way would be without modifying the main subsystem, which
means, making a new subsystem as a module. I dislike this, since it
requires duplication of the input
subsystem code, nonetheless I
could make it work with only the extreme minimum amount of code. I
will see how to do it :).
I stumbled upon this in the QMK
official documentation:
UNICODE_KEY_LNX | LCTL(LSFT(KC_U)) | The key to tap when beginning a Unicode sequence with the Linux input mode
Which made me think that there may be a way to input unicode
characters at kernel level. After, somewhat exhaustively, checking the
source code I noticed that only works at userland level with programs
known as IMF (Input method framework) that use or are configured to
use left control
+ left shift
+ u
key combination to active
themselves.
input.c
input_handle_event
This is the main handler of this subsystem, almost all the other
drivers indirectly call this function through input_event
. It is
kinda the first filter. It checks, through
get_input_disposition
, if the input is valid though.
void input_handle_event(struct input_dev *dev, unsigned int type, unsigned int code, int value) { int disposition; lockdep_assert_held(&dev->event_lock); disposition = input_get_disposition(dev, type, code, &value); if (disposition != INPUT_IGNORE_EVENT) { if (type != EV_SYN) add_input_randomness(type, code, value); input_event_dispose(dev, disposition, type, code, value); } }
input_get_disposition
This is the main check, it verifies the values are within the bounds,
otherwise it returns with a value that denotes this event should be
dropped (never queued). I just added the code that I am focusing,
therefore I am just reviewing the EV_KEY
case.
static int input_get_disposition(struct input_dev *dev, unsigned int type, unsigned int code, int *pval) { int disposition = INPUT_IGNORE_EVENT; int value = *pval; /* filter-out events from inhibited devices */ if (dev->inhibited) return INPUT_IGNORE_EVENT; switch (type) { /* [...] */ case EV_KEY:
is_event_supported
just checks if the key code is allowed to be
transmitted from the device, the upper bound in this case is
KEY_MAX
(0x2ff). dev
is a struct input_dev
and the keybit
is a
bitmap, each bit represents a key, keybit field is defined unsigned
long keybit[BITS_TO_LONGS(KEY_CNT)]
, thus it is an array of 96 uints,
holding up to possible 768 keys/bits.
Looks like that is the current amount of keys, it could get bigger I guess, but I do not think it will happen, a lot of keys could be reused, at least that is what I think. bits i)
if (is_event_supported(code, dev->keybit, KEY_MAX)) { /* auto-repeat bypasses state updates */ if (value == 2) { disposition = INPUT_PASS_TO_HANDLERS; break; } if (!!test_bit(code, dev->key) != !!value) { __change_bit(code, dev->key); disposition = INPUT_PASS_TO_HANDLERS; } } break; /* [...] */ } *pval = value; return disposition; }
keyboard.c
KBDMODES
There are different modes for console keyboards, depending the mode that is how the keycode will be translated or not for console usage. The following are the modes:
#define VC_XLATE 0 /* translate keycodes using keymap */ #define VC_MEDIUMRAW 1 /* medium raw (keycode) mode */ #define VC_RAW 2 /* raw (scancode) mode */ #define VC_UNICODE 3 /* Unicode mode */ #define VC_OFF 4 /* disabled mode */
I am quite sure that almost all the keyboards are handled as unicode,
or least that was the one that I used for debugging and also the one
made with uinput
. However, it is quite useful to know the other
modes.
- VC_XLATE: it will translate the key with the internal console mechanism. Pretty much it is the same as VC_UNICODE, since both modes use the console keymaps, but UNICODE has additional processes.
- VC_RAW: it only allows to send values from 0 up to 0x7f I.e. 127.
- VC_MEDIUMRAW: almost like raw, it resembles to RAW when keycode =< 0x7f, otherwise it extends raw with an additional byte.
- VC_UNICODE: it translates the keycode to UNICODE, it does not
expected well-formatted unicode, the
keyboard
machinery does that, the kernel never expects a UNICODE input. I think that it is uncommon to see a keyboard that send unicode unless it is a QMK keyboard, I may be biased, since keyboards from other countries could send unicode. - VC_OFF: I could find any relevant information about this. Looks like
under some circumstances
keyboard
will ignore some keys if this is activated.
Key code handlers
These symbols are defined as follows:
#define K_HANDLERS\ k_self, k_fn, k_spec, k_pad,\ k_dead, k_cons, k_cur, k_shift,\ k_meta, k_ascii, k_lock, k_lowercase,\ k_slock, k_dead2, k_brl, k_ignore static k_handler_fn *k_handler[16] = { K_HANDLERS };
Each element is a function that will be used to put the symbols on console.
Each position has a macro to access to it, it is defined as:
#define KT_LATIN 0 /* we depend on this being zero */ #define KT_FN 1 #define KT_SPEC 2 #define KT_PAD 3 #define KT_DEAD 4 #define KT_CONS 5 #define KT_CUR 6 #define KT_SHIFT 7 #define KT_META 8 #define KT_ASCII 9 #define KT_LOCK 10 #define KT_LETTER 11 /* symbol that can be acted upon by CapsLock */ #define KT_SLOCK 12 #define KT_DEAD2 13 #define KT_BRL 14
Keymaps
The default keymaps are defined in drivers/tty/vt/defkeymap.c
,
please notice that this keymap is compiled for a keymap file, thus
this is linux default, but you may have different one. There is a map
for every possible key state like shift, shift+ctrl, etc. This is a
keymap of 256 (NR_KEYS
) ushort values because each ushort encodes
two datums; the first one can be found in the 2 MSBs, with this can be
known which keycode handler will be used; the second one can be found
in the 2 LSBs, this datum holds the value of the keycode.
These keymaps are responsible for translating the keycode to a
ascii/unicode key symbol. This the definition of the default
plain_map
:
unsigned short plain_map[NR_KEYS] = { 0xf200, 0xf01b, 0xf031, 0xf032, 0xf033, 0xf034, 0xf035, 0xf036, 0xf037, 0xf038, 0xf039, 0xf030, 0xf02d, 0xf03d, 0xf07f, 0xf009, 0xfb71, 0xfb77, 0xfb65, 0xfb72, 0xfb74, 0xfb79, 0xfb75, 0xfb69, 0xfb6f, 0xfb70, 0xf05b, 0xf05d, 0xf201, 0xf702, 0xfb61, 0xfb73, 0xfb64, 0xfb66, 0xfb67, 0xfb68, 0xfb6a, 0xfb6b, 0xfb6c, 0xf03b, 0xf027, 0xf060, 0xf700, 0xf05c, 0xfb7a, 0xfb78, 0xfb63, 0xfb76, 0xfb62, 0xfb6e, 0xfb6d, 0xf02c, 0xf02e, 0xf02f, 0xf700, 0xf30c, 0xf703, 0xf020, 0xf207, 0xf100, 0xf101, 0xf102, 0xf103, 0xf104, 0xf105, 0xf106, 0xf107, 0xf108, 0xf109, 0xf208, 0xf209, 0xf307, 0xf308, 0xf309, 0xf30b, 0xf304, 0xf305, 0xf306, 0xf30a, 0xf301, 0xf302, 0xf303, 0xf300, 0xf310, 0xf206, 0xf200, 0xf03c, 0xf10a, 0xf10b, 0xf200, 0xf200, 0xf200, 0xf200, 0xf200, 0xf200, 0xf200, 0xf30e, 0xf702, 0xf30d, 0xf01c, 0xf701, 0xf205, 0xf114, 0xf603, 0xf118, 0xf601, 0xf602, 0xf117, 0xf600, 0xf119, 0xf115, 0xf116, 0xf11a, 0xf10c, 0xf10d, 0xf11b, 0xf11c, 0xf110, 0xf311, 0xf11d, 0xf200, 0xf200, 0xf200, 0xf200, 0xf200, 0xf200, 0xf200, 0xf200, };
Using the previous keymap we can obtain the value of the keycode 20,
which is the key T as defined in
include/dt-bindings/input/linux-event-codes.h
. Therefore, we infer
that the value of T is in the position 20 of this map (if there was
not any shift mask), that position holds the value 0xfb74. 0xfb - 0xf0
is the keycode handler that will be used, KT_LETTER, and 0x74 is the key symbol
which is t
.
kbd_keycode
It is in charge of trying to find the true keycode value that will be
delivered to console
, it also notifies all the subscribers of this
subsystem [2] about the new input that has been received.
vc
is declared as the data structure of the current console, which
is the device that will receive the keyboard input, it depends of
fg_console, which is declared in drivers/tty/vt/vt.c
.
param
will be, after some possible transformation, the value that
the subscribers will received.
static void kbd_keycode(unsigned int keycode, int down, bool hw_raw) { struct vc_data *vc = vc_cons[fg_console].d; unsigned short keysym, *key_map; unsigned char type; bool raw_mode; struct tty_struct *tty; int shift_final; struct keyboard_notifier_param param = { .vc = vc, .value = keycode, .down = down }; int rc; tty = vc->port.tty; if (tty && (!tty->driver_data)) { /* No driver data? Strange. Okay we fix it then. */ tty->driver_data = vc; }
kbd
represents the current state of the console state machine.
kbd = &kbd_table[vc->vc_num]; /* [...] some spack only config (which I do not need) */ rep = (down == 2);
As described in KBDSMOD, VC_RAW only sends up to 0x7f values to
console. This is the implementation. emulate_raw
checks if the value
is correct and then puts the value in console.
raw_mode = (kbd->kbdmode == VC_RAW); if (raw_mode && !hw_raw) if (emulate_raw(vc, keycode, !down << 7)) if (keycode < BTN_MISC && printk_ratelimit()) pr_warn("can't emulate rawmode for keycode %d\n", keycode);
Then comes the implementation of VC_MEDIUMRAW. Again, explained in KBDMODES.
Notice that the MSB in the first two bytes is used as a flag.
if (kbd->kbdmode == VC_MEDIUMRAW) { /* * This is extended medium raw mode, with keys above 127 * encoded as 0, high 7 bits, low 7 bits, with the 0 bearing * the 'up' flag if needed. 0 is reserved, so this shouldn't * interfere with anything else. The two bytes after 0 will * always have the up flag set not to interfere with older * applications. This allows for 16384 different keycodes, * which should be enough. */ if (keycode < 128) { put_queue(vc, keycode | (!down << 7)); } else { put_queue(vc, !down << 7); put_queue(vc, (keycode >> 7) | BIT(7)); put_queue(vc, keycode | BIT(7)); } raw_mode = true; }
key_down is a bitmap holding up to KEY_CNT
bytes (768 bits). It is
used to track down all the keys that are being pressed down. Its
definition is static DECLARE_BITMAP(key_down, KEY_CNT)
.
assign_bit(keycode, key_down, down);
The shift_final bitmask is created, it is used to designate which
keymap is going to be used. The possible keymaps (key_maps
) are defined in
drivers/tty/defkeymap.c
.
param.shift = shift_final = (shift_state | kbd->slockstate) ^ kbd->lockstate; param.ledstate = kbd->ledflagstate; key_map = key_maps[shift_final];
There is a protocol about how the subscribers of this subsystem have
to received the inputs, in begins sending the KBD_KEYCODE
event,
which is defined in include/linux/notifier.h
.
rc = atomic_notifier_call_chain(&keyboard_notifier_list, KBD_KEYCODE, ¶m);
If there is not a appropriate keymap, I.e. when shift_final mask is
invalid, it will notify the subscribers that a bad key was sent to
keyboard
, afterwards it cleans the sticky keys.
if (rc == NOTIFY_STOP || !key_map) { atomic_notifier_call_chain(&keyboard_notifier_list, KBD_UNBOUND_KEYCODE, ¶m); do_compute_shiftstate(); kbd->slockstate = 0; return; }
keysym
will get a value from this expression, it has to be less than
NR_KEYS
, otherwise it will be out of the upper bound of the keymaps;
however, there are some values used for braille devices.
if (keycode < NR_KEYS) keysym = key_map[keycode]; else if (keycode >= KEY_BRL_DOT1 && keycode <= KEY_BRL_DOT8) keysym = U(K(KT_BRL, keycode - KEY_BRL_DOT1 + 1)); else return;
Type holds the position of the keycode handler that will be used, it
is a ushort and KTYP
just right shifts 8 bits to get the MSB which
holds the position.
type = KTYP(keysym);
When type
(which is the function position datum) is invalid, it will
use KBD_UNICODE. I think that even though I used the term "invalid" it
is quite incorrect to use it, it could be deliberately encoded as such
to know when UNICODE should be used. If UNICODE, it will notify the
subscribers about this event, which is KBD_UNICODE
.
if (type < 0xf0) { param.value = keysym; rc = atomic_notifier_call_chain(&keyboard_notifier_list, KBD_UNICODE, ¶m); if (rc != NOTIFY_STOP) if (down && !raw_mode) k_unicode(vc, keysym, !down); return; }
This will be the last transformation to obtain the position for the handler function.
type -= 0xf0;
Then, based of the actual type, the keycode datum will be formatted
accordingly and sent to subscribers and to console. In this case it is
checked whether the code is one represented for a ascii letter, if so
it will use the k_self
handler, which will be used latter to send
the key as it is, without any processing.
if (type == KT_LETTER) { type = KT_LATIN; if (vc_kbd_led(kbd, VC_CAPSLOCK)) { key_map = key_maps[shift_final ^ BIT(KG_SHIFT)]; if (key_map) keysym = key_map[keycode]; } }
There are three different notifier events for this subsystem
(include/linux/notifier.h
): KBD_UNICODE, was used before to put and
send the unicode key; KBD_UNBOUND, was also used when the key was not
bound to any map; lastly KBD_KEYSYM, which means that we are sending a
key value, the representation of the key in ascii.
param.value = keysym; rc = atomic_notifier_call_chain(&keyboard_notifier_list, KBD_KEYSYM, ¶m);
Always asserting that the values do not return incorrect codes, otherwise just end the process.
if (rc == NOTIFY_STOP) return; if ((raw_mode || kbd->kbdmode == VC_OFF) && type != KT_SPEC && type != KT_SHIFT) return;
This is the actual use of the keycode handler, this function will put the keycode in the console, depending on the type, there will be some last transformation for the key code. Check K_handlers section to know the possible handlers, each of them represent a function, there will be the code to know how the input will be put on the console.
(*k_handler[type])(vc, keysym & 0xff, !down);
Last but no less important, cleaning sticky keys and sending the last
part of the KBD_KEYCODE
event (Only if the event was KBD_KEYSYM
).
param.ledstate = kbd->ledflagstate; atomic_notifier_call_chain(&keyboard_notifier_list, KBD_POST_KEYSYM, ¶m); if (type != KT_SLOCK) kbd->slockstate = 0;
Internal utilities
These are small functions used for basic things like predicative testings, E.g. checking if something is an event or not. This functions are "internal" because are found within the subsystem and not provided by any external module/subsystem or utilities functions of the kernel itself.
is_event_supported
It checks if an event is supported, however, it does not checks the
event itself but its values. There is a finite set of events, and each
has bounds of what values ranges accept, it is checking first if the
value is within the bounds, and if the some bit it enabled in
bm
. Most of the time is used to check if any key or sub-event (like
the *bit
fields in input_dev
) is already activated of expected
from a devices' input_dev
struct.
static inline int is_event_supported(unsigned int code, unsigned long *bm, unsigned int max) { return code <= max && test_bit(code, bm); }
Notifier
and subscribers
Did you remember that I talk about this API before? It is a internal kernel mechanism to allow other drivers or subsystems to receive event from other driver or subsystem, that is pretty what it is.
Well… input
and keyboard
have two specific usage for this
API. It only uses it to notify Braille and Text-to-speech devices, it
provides feedback about keys and/or symbol that was sent to
console. Sadly, both of those devices do not have support for the
internal key to UTF-8 mechanism that is in the kernel.
Some thought on how to solve my problem
Cannot be solved only in keyboard
keyboard
uses uint
as the type to send and received event codes,
this is enough for Unicode. Nonetheless, in only receives 1 byte per
event from input
, which where the problem lays. Thus I have to
modify input
to receive up to 4 bytes.
There is this a unicode mechanism built in console
, but this is only
for keymaps, it is not for inputs. That is way I also exclude using
this mechanism already built.
From the lowest level: input
I think that this would be the best solution, since is where the primitives events are handled and received. After checking the source code I am somewhat sure that this would not break anything.
This approach would be quite easy to use with uinput
, it would only
need to enable a new event type bit. What it is concerning is how this
is going to work with real keyboards, as far as I know there is no way
to enable that bit in physical keyboards, drivers should do that, but
it would imply modifying them.
- mod_devicetable.h
- input-event-codes.h
Status
I have a solution, but it only works when evdev
is not used
I.e. when keyboard
does not relies on evdev as the main event
manager. Thus, I am debugging and playing around with evdev
to make
it work.