Non null terminated strings.

sandras · Post by **sandras** » Fri May 04, 2012 5:33 am

Some time ago I was reading about null terminated strings and how they are inefficient (think str_len()). The suggested alternative approach was to store the length of a string in the first 16 bits of it. I did something similar in my kernel, i used 32-bit unsigned integer as the first argument to my prnt_str().

The articles (it was long ago enough for me now to not remember what articles) were entirely negative towards null terminated strings and positive towards their suggested alternative. Do you know of flaws in the suggested alternative or my way of doing it (I'm not interested in standards compliance)? What are they?

bluemoon · Post by **bluemoon** » Fri May 04, 2012 5:40 am

That's pascal string, disadvantage is:
1. string length is restricted by to prefix size. ie 1 byte prefix you have string < 256 (wide) characters; 2 bytes prefix would occupy more space than null-terminated string
2. fixed character width, can't be very flexible to implement utf8 on top of it.

advantage is quick, but watch out, you still need to count the character when you create/modify the string; so the total consumed time may not differ much. (it just a way of pre-calulate the strlen)

Solar · Post by **Solar** » Fri May 04, 2012 5:46 am

You also have to be careful when crossing API boundaries. If the kernel is using Pascal strings, a C userspace string (which is inherently null-terminated) would have to be converted at some point - or userspace would be faced with two different kinds of strings (the native C ones, and the kernel ones). And we all know how great that is for a feature (CString, anyone?).

If you manage to keep that abstraction in the kernel only, you won't have the problem - but you won't have much of an advantage either.

iansjack · Post by **iansjack** » Fri May 04, 2012 6:07 am

Null-terminated strings are great when you want to scan a string in some way - no need to check the length of the string to know when you have reached the end, just keep going till you hit a zero. And when it comes to dividing a string in two, say at some separator value, it's great. Just scan the string till you find the value and pop a zero in there instead. Now think of the processes to be followed with Pascal type strings.

Null-terminated strings are ideal when you are looking at relatively low-level operations.

sandras · Post by **sandras** » Fri May 04, 2012 7:20 am

bluemoon wrote:That's pascal string, disadvantage is:
1. string length is restricted by to prefix size. ie 1 byte prefix you have string < 256 (wide) characters; 2 bytes prefix would occupy more space than null-terminated string

That's a valid point, but I'm not quite worried about additional three bytes per string, given the speed advantage. And with 32-bit prefix, you can have a string as long as 4gb. I use 32-bit prefix on 32-bit machines (on 64-bit machines I'd use 64-bit prefix, etc. etc.), since I heard address bus sized integers are fastest on a particular machine. Not that speed is an issue now, but it may be later. After all, I picked this sort of strings implementation because of speed advantages.

bluemoon wrote:2. fixed character width, can't be very flexible to implement utf8 on top of it.

Valid too, and about this one I'm worried. I'd like to implement UTF-8 in the future, not only I think it's a good standard, but also widely used one (I'd have to correct myself - I don't care about standards compliance, where it gets in the way). But again, the strings can be as big as the entire memory, that is literally more than enough. Though I'm not sure if the prefix would indicate how many bytes or how many characters the string has in UTF-8. Maybe a solution would suggest itself, when implementing the standard.

bluemoon wrote:advantage is quick, but watch out, you still need to count the character when you create/modify the string; so the total consumed time may not differ much. (it just a way of pre-calulate the strlen)

Maybe, if I create a string byte by byte, but adding two strings, It'd be simpler and faster then with null terminated strings to count the length:

Code: Select all

void str_cat(unsigned int str_1_len, char* str_1, unsigned int str_2_len, char* str_2, unsigned int* ret_str_len, char* buf)
{
	str_prefix = str_1_len + str_2_len;

	...
}

Also it's a bit inconvenient to count the bytes in a string, when I do:

Code: Select all

prnt_str(what?, "a veeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeery long string : p")

Or am I being a noob and don't know an automated way to do it?

Solar wrote:a C userspace string (which is inherently null-terminated)

How come is it inherently null-terminated? As far as I understand, I can build a library, which uses strings with prefix, can't I? Just like I build a kernel.

Solar wrote:You also have to be careful when crossing API boundaries. If the kernel is using Pascal strings, a C userspace string (which is inherently null-terminated) would have to be converted at some point - or userspace would be faced with two different kinds of strings (the native C ones, and the kernel ones). And we all know how great that is for a feature (CString, anyone?).

If you manage to keep that abstraction in the kernel only, you won't have the problem - but you won't have much of an advantage either.

Having kernel have one kind of strings and user space have either different kind of strings and convert to kernel type of strings, or have user space have two kind of strings is nasty, to say the least. But how come C user space strings are inherently null terminated? Are you talking about existing C libraries? I intend to build my own, so if that's the problem, there's no problem. : ) I want to find out if I have an advantage with non null terminated strings and if so, extend it to user space, not only kernel space.

iansjack wrote:And when it comes to dividing a string in two, say at some separator value, it's great. Just scan the string till you find the value and pop a zero in there instead. Now think of the processes to be followed with Pascal type strings.

Null-terminated strings are ideal when you are looking at relatively low-level operations.

True, but in other places I believe, there are advantages for the alternative way. Like what I demonstrated with str_cat() example, and ofc str_len().

Also, wikipedia says null terminated strings have speed issues, and the are more sources. Besides, non null terminated strings tend to be less error prone, they say.

My guess now is that in some ways, one is better, and in some, the other. It would be best to optimize for what is most oftenly used. I think I'll try to implement all the string operations i can think of (I haven't yet, cus I didn't need them), and see where on type of string or the other would be advantageous. But that does not mean it will be best for performance, as some functions may be used more than the others. For example I've read, that str_len() is very often used. I should find the original article, that spawned this devil seed in me and re-read it, but I'm kinda in a hurry now.

bluemoon · Post by **bluemoon** » Fri May 04, 2012 7:27 am

Sandras wrote:That's a valid point, but I'm not quite worried about additional three bytes per string, given the speed advantage.

Did you measure the actual "speed advantage"?
Pascal string do not eliminate a strlen() process, it shift the process to the point when you create/modify strings.
However, depends on the usage, string creation/modification may not matter at all.
Furthermore, it is slower on some sort of string operation, which you may actually use quite frequently.

bluemoon · Post by **bluemoon** » Fri May 04, 2012 7:39 am

By the way, if you care speed, strlen() is not the main cause, it is very fast.
strcmp is the slow one, so if you accept pascal string which break compatibility anyway, I suggest to go further:

Code: Select all

[len][hash][payload]

you may use elf hash or just add-checksum, and strcmp() can be very fast. I do this on my dictionary class for very large data set.

On the other hand, you may also optimize c-string without breaking compatibility, for example, system generated string are padded with 8 zeros, so you can have faster string function when processing strings created by your kernel, while application can view the string without problem.

Combuster · Post by **Combuster** » Fri May 04, 2012 7:47 am

If you come to the point where you do quite a bit of stringbuilding, changing the string container might be quite effective. Using simple length-prefixed strings allows you to turn the datatype into actual byte strings and have them capable of storing null bytes, which is a handy thing when it comes to failsafing text-oriented network protocols. Concatenation would not need strlen but only a memory allocation. Reverse scans would increase performance.

You can even take it further and enter the string descriptor which contains length and a pointer to a byte array. With that you can splice read-only strings without a single copy or even memory allocation by keeping the descriptors on the stack. Especially useful for parsing or popping out chunks from a TCP connection.

OSwhatever · Post by **OSwhatever** » Fri May 04, 2012 7:59 am

Hunting speed in null terminated strings in the kernel is really more job than it pays off. Strings aren't used that much in a kernel and by far not a bottleneck. You can choose to implement your strings in the way you want in the kernel, however usually you want to use null terminated UTF-8 strings in the API as that is what is the easiest to pass from user to kernel space as it is only one pointer.

qw · Post by qw » Fri May 04, 2012 8:58 am

Considering UTF-8, do you store the number of characters or the number of bytes?

AndrewAPrice · Post by **AndrewAPrice** » Fri May 04, 2012 9:35 am

Hobbes wrote:Considering UTF-8, do you store the number of characters or the number of bytes?

I would go for the number of bytes - it lets you know where the array ends and makes copying, comparison, concatenation, faster.

AndrewAPrice · Post by **AndrewAPrice** » Fri May 04, 2012 1:17 pm

berkus wrote:This way you retain the speed (and it DOES slow down a lot when you have to do a lot of comparisons with a lot of long long null-terminated strings, unless you use something like KMP), and also keep the API flexible (just be aware that going from "rich" strings to C strings is easy, other way around may be not.

Well it depends on the language. For example:

Code: Select all

std::string a;
char *b;

// std::string to c:
b = a.c_str();

// c to std::string
b = a;

std::string, as far as I the implementations I've seen go, contains both a pointer to a null terminated array of characters and a size integer (some even have a maximum array capacity integer).

brain · Post by **brain** » Fri May 04, 2012 1:50 pm

If I remember correctly it is standard for std::string::length() to reference the integer length value you mentioned making it O(1) compared to O(n) for strlen on a standard char*.

OSwhatever · Post by **OSwhatever** » Fri May 04, 2012 2:51 pm

I have a string class in my kernel that stores the string together with a size field, not a big deal. Getting the length is O(1). Speed was not the reason I did though, it was copy on write that made me want to use a string class.

Solar · Post by **Solar** » Fri May 04, 2012 11:54 pm

Sandras wrote:
Solar wrote:a C userspace string (which is inherently null-terminated)
How come is it inherently null-terminated? As far as I understand, I can build a library, which uses strings with prefix, can't I? Just like I build a kernel.

Because the language standard defines a string as such:

ISO/IEC 9899:1999 wrote:A string is a contiguous sequence of characters terminated by and including the first null character.

I.e., you have the zero-terminated user-space string anyway. Any "custom" string type (or class) will either be on top of that (giving you the problem of different string types), or replacing it (making your library inherently non-standard-compliant, with all the issues that implies).

In a C/C++ environment, I simply wouldn't bother, and use the zero-terminated variant. Perhaps add a couple of functions that make handling those easier - like, strcat() returning a pointer to the end of the string. But a "new" string type would be something for a different language altogether.

OSDev.org

Non null terminated strings.

Non null terminated strings.

Re: Non null terminated strings.

Re: Non null terminated strings.

Re: Non null terminated strings.

Re: Non null terminated strings.

Re: Non null terminated strings.

Re: Non null terminated strings.

Re: Non null terminated strings.

Re: Non null terminated strings.

Re: Non null terminated strings.

Re: Non null terminated strings.

Re: Non null terminated strings.

Re: Non null terminated strings.

Re: Non null terminated strings.

Re: Non null terminated strings.