Does unicode support in the kernel needed?

Discussions on more advanced topics such as monolithic vs micro-kernels, transactional memory models, and paging vs segmentation should go here. Use this forum to expand and improve the wiki!
Post Reply
irvanherz
Member
Member
Posts: 27
Joined: Mon Sep 19, 2016 5:34 am

Does unicode support in the kernel needed?

Post by irvanherz »

I hesitate to write some functions for the kernel, because so far I learned, my kernel and other examples only shows support for ASCII only.
Now I started developing my kernel in C ++, after a long stop and reading many references. And now I am confused.
Do you think the kernel should have a function to support more than one type of character encoding (ASCII and UTF), or focus on unicode only?
What I'm confused about seems like this:
- Implement kprintf (AsciiString * string_obj) and something like kprintf_w (Utf8String * string_obj)
OR
- Just implement kprintf (String * string_obj); let's just say the default character encoding in our kernel is UTF-8
Korona
Member
Member
Posts: 1000
Joined: Thu May 17, 2007 1:27 pm
Contact:

Re: Does unicode support in the kernel needed?

Post by Korona »

That depends on what functionality you put into the kernel. For microkernels, thhe answer would generally be no. Even for monolithic kernels I find it hard to imagine a situation where the kernel needs to interpret incoming unicode data correctly: File names and similar identifiers are usually treated as opaque byte sequences - they might be encoded in UTF-8 but the kernel does not perform any normalization on them.

I do not think the kernel should ever generate non-ASCII data. Kernel log messages can be in English only. If the user ever gets to see a kernel panic, the language barrier will not be the reason they are unable to fix the problem.
managarm: Microkernel-based OS capable of running a Wayland desktop (Discord: https://discord.gg/7WB6Ur3). My OS-dev projects: [mlibc: Portable C library for managarm, qword, Linux, Sigma, ...] [LAI: AML interpreter] [xbstrap: Build system for OS distributions].
User avatar
Solar
Member
Member
Posts: 7615
Joined: Thu Nov 16, 2006 12:01 pm
Location: Germany
Contact:

Re: Does unicode support in the kernel needed?

Post by Solar »

Seconded. Any valid ASCII (-7) string is also a valid UTF-8 string. As the kernel should never have to bother with what a string actually contains (like, counting words etc.), you should be safe.

The important thing you already have covered -- you know about Unicode, that number of bytes does not equal number of characters etc. -- so you should be able to avoid related pitfalls if they come up during the design process.
Every good solution is obvious once you've found it.
User avatar
MichaelFarthing
Member
Member
Posts: 167
Joined: Thu Mar 10, 2016 7:35 am
Location: Lancaster, England, Disunited Kingdom

Re: Does unicode support in the kernel needed?

Post by MichaelFarthing »

A file system might want to normalise file names to be case insensitive and might reasonably also allow multiscript filenames. This would require something beyond ASCII and not all byte sequences in UTF8 are valid and ought, I think, to be checked for. UTF8 also has some problems, beyond case sensitivity, with alternative representations of some characters (eg diacritical letters). Being a micro-man I don't regard file systems as a legit part of a kernel, but the point remains for the monolithic.

It could be made a user responsibility to do all this (probably with a standard library function) but my inclination would be to put this as a responsibility for the file system.
User avatar
Solar
Member
Member
Posts: 7615
Joined: Thu Nov 16, 2006 12:01 pm
Location: Germany
Contact:

Re: Does unicode support in the kernel needed?

Post by Solar »

MichaelFarthing wrote:A file system might want to normalise file names...
A thing of the file system driver, which I indeed did not consider to be part of the kernel.
MichaelFarthing wrote:This would require something beyond ASCII and not all byte sequences in UTF8 are valid and ought, I think, to be checked for. UTF8 also has some problems, beyond case sensitivity, with alternative representations of some characters (eg diacritical letters).
That is not as much a "problem" as the issue with normalized / denormalized / unnormalized UTF-8.

Personally I would require filenames to be presented to kernel system calls as normalized UTF-8, with a normalization done in the (userspace) wrapper for said system call. Which would enable you to rely on ready-made third-party software (like ICU) for that task, without having to drag that into kernel space.
Every good solution is obvious once you've found it.
User avatar
zaval
Member
Member
Posts: 660
Joined: Fri Feb 17, 2017 4:01 pm
Location: Ukraine, Bachmut
Contact:

Re: Does unicode support in the kernel needed?

Post by zaval »

This is a very painful question. For me it's yes and no. No, because I don't see a need to use anything other than Latin letters for the kernel and system interface for developers/administartors. anything named - for example - registry keys, OS components' file names, those few named internal objects - they all should be ANSI only. In an international community it's enough to use just one language supposed to be the international interface means. and one ASCII encoding for it. My system isn't going to name drivers or device objects in Ukrainian or Chinese. It's just overkill.
On the other hand, there is no strict boundary where this administrator/developer area ends and a normal user area begins. Normal user might want to see text in their native language.
I've not decided yet, but I am inclined to have 2 variants - ANSI and UTF-16. Say, OpenFileA() and OpenFileW().
But it's so "easy" only as claims. Might be that other than the approach taken in Windows (everything inside is represented as UTF-16) isn't possible.
For example how to combine the internal Object Manager ANSI encoded namespace with Unicode File system part? I could put resrtiction on Registry (ANSI only, that could be met, but only for key names!), but couldn't enforce this on the FS level. Anyway I am going to limit text usage in kernel to minimum, and am thinking on binary object namespace (GUIDs). So if this combination will succeed, then I only would have to deal with Unicode at the FS level. Of course GUI, should be Unicode only.
But any debugging/developer oriented output from the kernel is ANSI only.
ANT - NT-like OS for x64 and arm64.
efify - UEFI for a couple of boards (mips and arm). suspended due to lost of all the target park boards (russians destroyed our town).
User avatar
Solar
Member
Member
Posts: 7615
Joined: Thu Nov 16, 2006 12:01 pm
Location: Germany
Contact:

Re: Does unicode support in the kernel needed?

Post by Solar »

You already made the mistake of using "ANSI" and "ASCII" as if the two would mean the same thing... ;-)

Anyway. UTF-16 has the issue of endianess to contend with, embedded zeroes, and a good deal of storage wasted for the majority of texts. Plus, Microsoft has severely muddied the waters in the past with using the terms "Unicode" and "UTF-16" even for software that really only did UCS-2 (not to mention the chimera that is TSTRING...).

I'd strongly recommend going for UTF-8 throughout, as it's not burdened with endianess, can be handled somewhat comfortable in string classes / functions not even aware of its existence, and has never been entangled in questionable phrasing.

You also don't get into the ugly details of 2-byte vs. 4-byte wchar_t...

UTF-8 Everywhere (Seriously, read it. It's not a rant but a well-sourced discussion on the various Unicode encodings.)
Every good solution is obvious once you've found it.
User avatar
Solar
Member
Member
Posts: 7615
Joined: Thu Nov 16, 2006 12:01 pm
Location: Germany
Contact:

Re: Does unicode support in the kernel needed?

Post by Solar »

By the way, I have to retract one of my earlier statements. Merely normalizing a Unicode string doesn't help. While re-reading the UTF-8 Everywhere Manifesto, I was reminded of one detail I had forgotten... that different normalized code points can still be semantically identical. The examples from the manifesto are { U+03A9 greek capital letter omega } and { U+2126 ohm sign }...
Every good solution is obvious once you've found it.
User avatar
zaval
Member
Member
Posts: 660
Joined: Fri Feb 17, 2017 4:01 pm
Location: Ukraine, Bachmut
Contact:

Re: Does unicode support in the kernel needed?

Post by zaval »

Yes I did a mistake using ANSI and ASCII for the same, ANSI is an organization at all, doh. I meant basically that encoding that uses 1 byte numbers up to 127 for encoding the most usable symbols, it's called Latin-1 or something, I don't care yet.

As of UTF-16 or UTF-8. UEFI and Windows use UTF-16, so definitely UTF-8 isn't "everywhere". Problem with endianness is a problem, not with misinterpretation, but with additional work that might occur. For example it might occur on the vacuum BE PPC port of my OS, when it will read FS file names. But UTF-8 for anything that doesn't use plain latin letters becomes a video decoding. :lol: It's more work than for UTF-16.
The best approach would be picking the best encoding for the particlular case and store info about it. But it's impossible for anything outside of your system, like FSs, many different formats.

So far, I am sure in the only thing, that when developing a system, anything for the internal use, not intended to end up in UI of any kind, will be in the good old and 1-byte ANSI/ASCII/ISO-Whatever encoding. :)
ANT - NT-like OS for x64 and arm64.
efify - UEFI for a couple of boards (mips and arm). suspended due to lost of all the target park boards (russians destroyed our town).
OSwhatever
Member
Member
Posts: 595
Joined: Mon Jul 05, 2010 4:15 pm

Re: Does unicode support in the kernel needed?

Post by OSwhatever »

I made the decision to use utf8 everywhere, both for user space services and also some kernel calls but as it is a microkernel these are many. For the kernel calls maybe it could have been enough with ASCII as it is only used for resource handling but I went utf8 anyway as I had the infrastructure for it. This haven't been particular hard for me I think.

I would say the opposite, can you make an OS that only supports ASCII today and I would say no. For hobbyist maybe but a commercial OS, no way.

Another thing that I have removed are the zero terminated strings. System calls require a size together with the string data. Zero terminated strings is one of those historical mistakes that are more persistent than herpes.

utf8 seems be the way of the future though. Rust has utf8 string handling by default and for other languages native utf8string classes/implementation becomes more common.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Does unicode support in the kernel needed?

Post by Brendan »

Hi,
irvanherz wrote:Do you think the kernel should have a function to support more than one type of character encoding (ASCII and UTF), or focus on unicode only?
I think the majority of the OS (VFS, file systems, help system, any logging, all programming languages, all APIs for GUI or command line, ...) should support UTF-8 and nothing else. The only exceptions to this are that a few applications (text editor, web browser) may convert data from other encodings into UTF-8 for compatibility purposes (e.g. in case user opens a file encoded as UTF-16); and code that converts strings into pixel data ("font renderer") may internally convert to UTF-32 if it makes things easier and file formats used for font data may be designed around "UTF-32 indexing".
irvanherz wrote:What I'm confused about seems like this:
- Implement kprintf (AsciiString * string_obj) and something like kprintf_w (Utf8String * string_obj)
OR
- Just implement kprintf (String * string_obj); let's just say the default character encoding in our kernel is UTF-8
As a micro-kernel fan; I'd only ever have an "append string to kernel log"; where anything (in user space) can ask to be notified when kernel log changes, including (e.g.) "kernel log viewer" applications, and including the VFS process (which may write kernel log to disk).

Functions like "printf()" and "kprintf()" are inefficient (require run-time parsing of the format string) and are considerably complex, and are inferior to "string builder" approaches (e.g. "cout" in C++, where you end up with small/simple functions/methods to convert pieces into sub-strings that are concatenated, and where there's no runtime parsing of a format string).

Note that part of the reason for this is atomicity - the ability to build a temporary string from many pieces; and then do "atomic append" or "atomic write" of all the pieces. In some circumstances this is very important. For example, for kernel log (where many CPUs might be adding to the kernel log at the same time) you don't want the log to become a jumbled mess (e.g. one CPU writes "foo" while another writes "bar" and you end up with "fboaor") and you don't want the hassle of explicitly managing a "kernel log lock" (e.g. acquire the lock, then print many lines of "memory map" with many newline characters, then release the lock; to make sure that nothing else adds unrelated lines of stuff in the middle of the memory map), and don't want excessive "kernel log lock contention" (because CPUs are doing extra work converting many pieces while the lock is held instead of doing that work before the lock is acquired).

More notes:
  • For security purposes, you want to ensure that it's impossible for processes to create file names that can't be typed and/or can't be displayed. This means that you can't do things like UTF-8 normalisation (or "UTF-8 canonicalisation") in user-space and then assume that user-space isn't malicious (e.g. and didn't deliberately do it wrong so that all software that does it correctly isn't able to construct a matching file name; and didn't deliberately provide a file name consisting of zero-width spaces or control characters or invalid UTF-8 bytes to prevent the file name from being displayed). For this reason I'd suggest that the VFS layer (which naturally must be "trusted" anyway) is the best place to do sanity checks and things like UTF-8 normalisation/canonicalisation.
  • For compatibility purposes, different ("non-native") file systems have different requirements (case sensitivity, allowed/disallowed characters, character encodings, name lengths, ...). This means that for a good/modular approach (where most of a file system's details are abstracted) there needs to be some cooperation between VFS and file system modules, where the file system code hides differences where possible (and does any conversion from UTF-8 to whatever encoding the file system expects) but the VFS has to be informed of differences that the file system code can't reasonably hide. This cooperation is not easy to design.
  • Case insensitivity is nasty. For example (for compatibility purposes), a different OS that is case sensitive might create files where the only difference between the file names is case (e.g. three files called "FOO", "Foo" and "foo" all in the same directory) and a case insensitive OS will be unable to handle that correctly (will never be able to access some of the files by name). Also note case conversion (converting everything to the same case for case insensitive comparison) is complex and locale dependent (for one example, the result of converting 'i' to upper case depends on whether it's Turkish or not) and is something I'd rather avoid dealing with.

Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
irvanherz
Member
Member
Posts: 27
Joined: Mon Sep 19, 2016 5:34 am

Re: Does unicode support in the kernel needed?

Post by irvanherz »

Korona wrote:I do not think the kernel should ever generate non-ASCII data. Kernel log messages can be in English only. If the user ever gets to see a kernel panic, the language barrier will not be the reason they are unable to fix the problem.
do not forget about VFS. In UN*X-like systems, tty are built with a VFS foundation. So, is it possible to implement tty without unicode support?
Last edited by irvanherz on Sat Mar 10, 2018 8:24 am, edited 1 time in total.
irvanherz
Member
Member
Posts: 27
Joined: Mon Sep 19, 2016 5:34 am

Re: Does unicode support in the kernel needed?

Post by irvanherz »

Brendan wrote:Hi,
I think the majority of the OS (VFS, file systems, help system, any logging, all programming languages, all APIs for GUI or command line, ...) should support UTF-8 and nothing else. The only exceptions to this are that a few applications (text editor, web browser) may convert data from other encodings into UTF-8 for compatibility purposes (e.g. in case user opens a file encoded as UTF-16); and code that converts strings into pixel data ("font renderer") may internally convert to UTF-32 if it makes things easier and file formats used for font data may be designed around "UTF-32 indexing".
Brendan
OK, now I've got enlightenment from this opinion.
I plan to manipulate all strings in the kernel with a String object
So, do you think creating a String class that based on UTF-8 is the best way?
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Does unicode support in the kernel needed?

Post by Brendan »

Hi,
irvanherz wrote:
Korona wrote:I do not think the kernel should ever generate non-ASCII data. Kernel log messages can be in English only. If the user ever gets to see a kernel panic, the language barrier will not be the reason they are unable to fix the problem.
do not forget about VFS. In UN*X-like systems, tty are built with a VFS foundation. So, is it possible to implement tty without unicode support?
TTY consumes characters. To be able to consume UTF-8 characters (e.g. generated by applications) a TTY would have to support UTF-8. To be able to consume ASCII characters (e.g. generated from kernel) a TTY wouldn't need to support UTF-8.

Note that while I mostly agree with Korona; I'd go further (all software that generates text intended for developers or administrators should use English; and all text that is intended for normal users should be "internationalised"). However; "English" doesn't necessarily mean ASCII and can include some "non-ASCII" where appropriate - e.g. © and ™, and things like é where they should occur in English (but often don't); and µS rather than uS; and various mathematical signs (× and ÷ rather than * and /), etc.
irvanherz wrote:I plan to manipulate all strings in the kernel with a String object
So, do you think creating a String class that based on UTF-8 is the best way?
I'm really the wrong person to answer that; but if you're using C++ anyway (and if it doesn't provide a useful string class in its standard library) then writing your own (or downloading someone else's) would seem to make sense.


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Post Reply