Chris Bensen: Unicode

Showing posts with label Unicode. Show all posts

Tuesday, September 23, 2008

Delphi 2009 - Unicode in Type Libraries

If you've opened up the Tools | Options in Delphi 2009 and taken a look at C++ Type Library Options and Delphi Type Library Options you'll notice the options are relatively the same as previous version but there are a few additions that might not be all that easy to understand. Here's a screen capture of the two dialogs:

Notice the two UTF8 options:

- Store Unicode data as UTF8 in type library
- Check for UTF8 data in type library

I'll explain what these do but first I we need a bit of a back story.

Back when the COM team was working on the new COM features for Tiburon (Delphi 2009 and C++ Builder 2009), we found that ICreateTypeLib, ICreateTypeLib2, ICreateTypeInfo and ICreateTypeInfo2 don't actually support BSTRs even thoough all their string paramters are BSTRs. Somewhere in the writing of the .tlb file the data is narrowed down and the unicode data is lost. After some testing using MIDL 5.01.0164 we found that Microsoft had apparently known about this and worked around it by UTF8 encoding the data and then stuffing it into the BSTR. Then we found that the latest version of MIDL 7.00.0500 produced errors when compiling files with unicode data.

So we wanted to support unicode throughout the product so we added support for UTF8 data in type libraries. The two options read UTF8 data and write UTF8 data. This means you can create unicode identifiers (functions and classes) for use between Delphi and C++Builder 2009 (and apparently MIDL 5.01.0164) but that's about it. I would suggest staying away from unicode identifiers in your Type Libraries.

Tuesday, January 29, 2008

There Will Not be a Compiler Switch for Unicode

There have been a lot of questions about a compiler switch to change what type string maps to. Just to avoid any confusion the title of this post say's it all, but here are a few more specifics. String will map to UnicodeString. If you need AnsiString you still can use it, but string will not map to AnsiString.

If you think about it a switch isn't a viable solution. If there were a switch there would be three primary ways of doing it:

1. All generated files would contain both ANSI and unicode code doubling the size of each file.

2. All source files would be compiled twice to generate ANSI and unicode versions of all generated files.

3. The string type would have a run-time switch for ANSI or unicode.

Each of these options has advantages and disadvantages but the fact is each of these options add complexity. The simplest solution is ANSI or unicode. I personally choose unicode.

Wednesday, January 23, 2008

Unicode: SizeOf(Char) and Sizeof(Byte)

New to Tiburon SizeOf(Char) will not equal Sizeof(Byte). This means that any pointer arithmetic currently being done by casting something to a PChar should be changed to use a PByte. This is a change in the language because all current versions don't allow pointer arithmetic on PByte and of course because Char will be mapped to WideChar instead of AnsiChar.

So if you are going through your code now to get it ready for unicode, I suggest adding {$IFDEF UNICODE} or something equivalent around the code that is ANSI only so you can test it and mark it to be looked at later.

I'm thinking of creating a unicode FAQ where I gather up all the unicode information into one location. Would that be useful for everyone?

Thursday, November 15, 2007

Unicode: SizeOf is Different than Length Part II

One thing I forgot to mention in my last unicode post about SizeOf and Length. You can specify the size in bytes of a buffer two ways correctly:


var
  ByteCount: Integer;
  Buffer: array[0..255] of Char;
begin
  ByteCount := SizeOf(Buffer);                // Version 1
  ByteCount := Length(Buffer) * SizeOf(Char); // Version 2
  ByteCount := Length(Buffer) * SizeOf(Buffer[1]); // Version 2 with more clarity
end;

I suggest just going with the first version.

Tuesday, November 13, 2007

Unicode: SizeOf is Different than Length

The next product on the on the Roadmap is Tiburón which focuses on Unicode. The Delphi Product Roadmap states "Delphi Win32 Unicode...means that the IDE, the VCL, and all types of development should be made fully Unicode-compatible. The standard string in the Delphi language will become a Unicode string, meaning that the IDE, the VCL – that is, the entire product – will be Unicode-based. Developers around the world will be able to develop applications for use in any language using the Unicode standard." So I figured some examples of how to watch out for common pitfalls would be in order.

When checking over your code to make sure it is unicode enabled, take a good close look at your calls to Length and SizeOf on Char arrays. SizeOf and Length return the same value so they've been used interchangeably, but they have different meanings. Given an array of AnsiChar and and array of WideChar:


var
  AnsiBuffer: array[0..MAX_PATH - 1] of AnsiChar;
  WideBuffer: array[0..MAX_PATH - 1] of WideChar;

Length(AnsiBuffer) and SizeOf(AnsiBuffer) are the same, in the unicode world Length(WideBuffer) and SizeOf(WideBuffer) are not the same. So be sure you call the correct function in the correct locations. Many functions require the size of the array in bytes, while others expect the count of characters.

Chris Bensen

Tuesday, September 23, 2008

Delphi 2009 - Unicode in Type Libraries

Tuesday, January 29, 2008

There Will Not be a Compiler Switch for Unicode

Wednesday, January 23, 2008

Unicode: SizeOf(Char) and Sizeof(Byte)

Thursday, November 15, 2007

Unicode: SizeOf is Different than Length Part II

Tuesday, November 13, 2007

Unicode: SizeOf is Different than Length

Blog Info

Search Site

Geographic Distribution Of the Site"s Readership

Recommended Reading List

Monthly Archive

Whoa there you just hit bottom. You can find more stuff in the monthly archive.