Software Testing Blog

Why does C# use UTF-16 for strings?

Today on ATBG a langauge design question from reader Filipe, who asks:

Why does C# use UTF-16 as the default encoding for strings instead of the more compact UTF-8 or the fixed-width UTF-32?


Good question. First off I need to make sure that all readers understand what these different string formats are. Start by reading Joel’s article about character sets if you’re not clear on why there are different string encodings in the first place. I’ll wait.

.
.
.
.

Welcome back.

Now you have some context to understand Filipe’s question. Some Unicode formats are very compact: UTF-8 has one byte per character for the sorts of strings you run into in American programs, and most strings are pretty short even if they contain characters more commonly seen in Europe or Asian locales. However the down side is that it is difficult to index into a string to find an individual character because the character width is not a fixed number of bytes. Some formats waste a lot of space: UTF-32 uses four bytes per character regardless; a UTF-32 string can be four times larger than the equivalent UTF-8 string, but the character width is constant.

UTF-16, which is the string format that C# uses, appears to be the worst of both worlds. It is not fixed-width; the “surrogate pair” characters require two 16 bit words for one character, most characters require a single 16 bit word. But neither is it compact; a typical UTF-16 string is twice the size of a typical UTF-8 string. Why does C# use this format?

Let’s go back to 1993, when I started at Microsoft as an intern on the Visual Basic team. Windows 95 was still called Chicago. This was well before the Windows operating system had a lot of Unicode support built in, and there were still different versions of Windows for every locale. My job, amongst other things, was to keep the Korean and Japanese Windows machines in the build lab running so that we could test Visual Basic on them.

Speaking of which: the first product at Microsoft that was fully Unicode internally, so that the same code could run on any localized operating system, was Visual Basic; this effort was well underway when I arrived. The program manager for this effort had a sign on his door that said ENGLISH IS JUST ANOTHER LANGUAGE. That is of course a commonplace attitude now but for Microsoft in the early 1990s this was cutting edge. No one at Microsoft had ever attempted to write a single massive executable that worked everywhere in the world. (UPDATE: Long time Microsoftie Larry Osterman has pointed out to me that NT supported UCS-2 in 1991, so I might be misremembering whether or not VB was the first Microsoft product to ship the same executable worldwide. It was certainly among the first.)

The Visual Basic team created a string format called BSTR, for “Basic String”. A BSTR is a length-prefixed UCS-2 string allocated by the BSTR allocator. The decision was that it was better to waste the space and have the fixed width than to use UTF-8, which is more compact but is hard to index into. Compatibility with the aforementioned version of NT was likely also a factor. As the intern who, among other things, was given the vexing task of fixing the bugs in the Windows 3.1 non-Unicode-based DBCS far east string libraries used by Visual Basic, I heartily approved of this choice.

Wait a minute, what on earth is UCS-2? It is a Unicode string consisting of 16 bit words, but without surrogate pairs. UCS-2 is fixed width; there are no characters that consist of two 16 bit words, as there are in UTF-16.

But… how on earth did that work? There are more than two to the sixteen Unicode characters! Well, it was 1993! UTF-16 was not invented until 1996.

So Visual Basic used UCS-2. OLE Automation, the COM technology that lets VB talk to components, of course also used the BSTR format.

Then UTF-16 was invented and is compatible with UCS-2, so “for free” VB and OLE Automation got upgraded to UTF-16 a few years later.

When the .NET runtime was invented a few years after that of course it used length-prefixed UTF-16 strings to be compatible with all the existing COM / Automation / VB code out there.

C# is of course compatible with the .NET runtime.

So there you go: C# uses length-prefixed UTF-16 strings in 2014 because Visual Basic used length-prefixed UCS-2 BSTRs in 1993. Obviously!

So how then does C# deal with the fact that there are strings where some characters take a single 16 bit word and some take two?

It doesn’t. It ignores the problem. Just as it also ignores the problem that it is legal in UTF-16 to have a character and its combining accent marks in two adjacent 16 bit words. And in fact, that’s true in UTF-32 too; you can have UTF-32 characters that take up two 32-bit words because the accent is in one word and the character is in the other; the idea that UTF-32 is fixed-width in general is actually rather suspect.

Strings with surrogate pairs are rare in the line-of-business programs that C# developers typically write, as are combining mark characters. If you have a string that is full of surrogate pairs or combining marks or any other such thing, C# doesn’t care one bit. If you ask for the length of the string you get the number of 16 bit words in the string, not the number of logical characters. If you need to deal with strings measured in terms of logical characters, not 16 bit words, you’ll have to call methods specifically designed to take these cases into account.

By ignoring the problem, C# gets back into the best of both worlds: the string is still reasonably compact at 16 bits per character, and for practical purposes the width is fixed. The price you pay is of course that if you care about the problems that are ignored, the CLR and C# work against you to some extent.


As always, if you have questions about a bug you’ve found in a C, C++, C# or Java program that you think would make a good episode of ATBG, please send your question along with a small reproducer of the problem to TheBugGuys@Coverity.com. We cannot promise to answer every question or solve every problem, but we’ll take a selection of the best questions that we can answer and address them on the dev testing blog every couple of weeks.

  1. is it also the reason why Windows uses UTF16?

    The latest part of the article is a bit scaring, but indeed it’s a very practical solution, until the first time you have troubles with “strange” strings…

  2. @Qb: If I remember my MS history correctly[0], Windows uses UTF-16 because MS decided to build Unicode into Windows NT from the ground up when they started NT development in 1989, and UCS-2 was about the only game in town for Unicode encodings at that time.[1] UTF-8 didn’t get invented on a napkin in a diner by Thompson and Pike until late 1992. Therefore, they used Unicode as UCS-2 throughout[2] with the creation of the *W() functions in the core, and the *A() wrappers around them for legacy-encoding back-compat purposes.

    [0] And I might not, so confirmation of all this would be appreciated. I wasn’t there; I’ve just picked up and aggregated tales from books, articles, and blog posts over the years.
    [1] This is also likely to be the origin of “Unicode” being an MS synonym for UCS-2/UTF-16, because when they started using it, it pretty much was.
    [2] Which later turned into UTF-16, as described above.

      1. Heh. I think the blog posts I’ve picked most of my MS history up from were probably predominantly Larry Osterman’s, with a smattering of Raymond Chen thrown in. Those guys are awesome.

  3. That’s not it. Converting to a COM string is still a heavy conversion, different heap and different format. The CLR uses utf-16 because it runs on a utf-16 operating system. Pinvoke is the big dog, pervasive in the framework, much more so than COM. System.String is even zero-terminated to make it fast.

  4. There’s more to it than just that. There was something of a move towards use of utf-8 in the late 90s in some cases as http was growing more popular (and normalized URLs are, I believe, supposed to be %-encoded utf-8 encoded Unicode text). But analysis was done and while UCS-2 was fatter for people who could otherwise live happily within ISO Latin-1 (ISO-8859-1), for the rest of the world it was either a wash or for the far east, utf-8 was actually worse.

  5. Acutally, I’m confused about which team created the BSTR format. You say the VB team, but it seems to me to more likely have been the COM team.

    If the VB team created it, that would seem to imply that the COM team was a subset of the VB team; or that VB predated COM and COM borrowed the BSTR format later – but then when they added IDispatch on top of COM later still, that made having used BSTR (rather than any other string format) incredibly prescient/fortunate.

    Or were COM/VB developed in tandem maybe?

    1. Good question. COM is a binary standard for calling methods on objects. OLE is a standard mechanism built on top of COM describing how to “link” and “embed” objects, for example, a Word document with an Excel table embedded inside it. OLE Automation is a standard mechanism for late-bound invocation of COM objects used in OLE, for the purpose of automating such objects via dynamic programming languages, VB being the canonical example. As you note, IDispatch is the main COM interface used in OLE Automation. OLE Automation was created by a two-person subteam of the Visual Basic for Applications team, which was itself a subteam of the Visual Basic team, and they drove the adoption of the length-prefixed BSTR format as the standard string format for Visual Basic.

  6. Is it time for a true “text” type? One whose API deals with codepoints? One that isn’t an encoding of a chunk of text, but is a chunk of text? One where the “Substring” method cannot possibly, ever, cut an encoding unit in two?

    I sure think it is.

    1. As far as I know, Python (3) does that. The underlying actual UTF is chosen based on the string contents, but for Unicode strings you don’t have access to that anyway. Everything operates on code points at that level.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Current day month ye@r *