Tuesday, November 13, 2007

Unicode: SizeOf is Different than Length

The next product on the on the Roadmap is Tiburón which focuses on Unicode. The Delphi Product Roadmap states "Delphi Win32 Unicode...means that the IDE, the VCL, and all types of development should be made fully Unicode-compatible. The standard string in the Delphi language will become a Unicode string, meaning that the IDE, the VCL – that is, the entire product – will be Unicode-based. Developers around the world will be able to develop applications for use in any language using the Unicode standard." So I figured some examples of how to watch out for common pitfalls would be in order.

When checking over your code to make sure it is unicode enabled, take a good close look at your calls to Length and SizeOf on Char arrays. SizeOf and Length return the same value so they've been used interchangeably, but they have different meanings. Given an array of AnsiChar and and array of WideChar:


var
AnsiBuffer: array[0..MAX_PATH - 1] of AnsiChar;
WideBuffer: array[0..MAX_PATH - 1] of WideChar;

Length(AnsiBuffer) and SizeOf(AnsiBuffer) are the same, in the unicode world Length(WideBuffer) and SizeOf(WideBuffer) are not the same. So be sure you call the correct function in the correct locations. Many functions require the size of the array in bytes, while others expect the count of characters.

18 comments:

Anonymous said...

I think it would be a really good idea for CodeGear to release some technical "white papers" about the transition to Unicode well in advance (now would be fine) of the release of the next Delphi version.
I am guessing that you already have a good idea about how it will end up so some very precise info could be provided.
Also it might be an idea to release a "compiler" that could issue warnings about potential problems as an update to RAD Studio 2007. If you do not I think the transition to the next version will be very slow (and thus sale of upgrades as well).

Anonymous said...

I don't understand it - why the bloody hell can't string be left alone and why can't the WideString be used for unicode stuff? Changing core things is excitement, fun and games, but it breaks awfully lot of things.
I am somewhat annoyed. Use WideString, don't change string! (or at least add compiler directive that allows string work like string should work)

Anonymous said...

I second what Aivars said. New features are great and Unicode is the way, but breaking the old code just does not make sense. Making popular functions behave differently all of a sudden is just so irresponsible.

Chris Bensen said...

Lars,

I believe all of that is coming. I'm unsure of the time frame. I agree with you, this could be a scary transition, which is why I plan to get as much information in your hands as possible. But we are making every effort to make sure upgrading is a smooth process.

Chris Bensen said...

Aivars and Kalis,

AnsiString and WideString will still exist and they will still function exactly as they have since they were introduced. To enable VCL for Unicode something will have to change because VCL is currently only ANSI. I assure you it is more complicated than just use WideString. Compiler support is being put in place so smooth things over as much as we can. There are some caveats with popular functions such as SizeOf and Length where they have been misused. They don’t work differently they work exactly as they have since they were introduced. And I plan to point those out as I did with this post.

What sort of existing string functionality are you currently doing that you have concerns over? Can you post examples?

Anonymous said...

One problem with old code will appear in those cases, where a string is used as a simple way to store bytes 1:1. To read a file from a filestream for example or to store a picture in memory...
If you read .NET or Java-boards you will often find, that problems are caused because byte[] and string are used as the same, although these
languages were born with Unicode, so these kind of problems will appear in many old sources.

Anonymous said...

I quickly looked over some stuff I've been working on most recently and these are just couple of examples that will break:

1.
k:= R.GetDataSize('ProtectedPort'+IntToStrEx(f));
SetLength(SName, k);
R.ReadBinaryData('ProtectedPort'+IntToStrEx(f), SName[1], k);
- this will appear to work, but later when SName is processed character by character, errors will occur. And it might be very hard for me to notice the problem unless I go through the whole code.

2. Exactly what you warned about in this post:
S:= FileName;
C:= Length(S);
GetMem(P, 1 + C + 4);
Why am I using Length instead of sizeof? Because I know exactly what Length will return, but I'm not sure if SizeOf counts the bytes that contain length of the string as well or not.

3. Writing to file - the resulting data won't be what's expected if string is secretly holding 2 or more bytes instead of one for each character. It might be less important if my application is the one reading the file.

4. What happens when you read b:= s[f], where b is byte and s[f] contains something larger than 255? Range check error? What happens if b is Integer but is then processed by function that expects values 0-255?

Actually most of these and other examples are similar and they all have something to do with binary data appearing in string at some point. If you have to kill string, the least you could do is add a new type that acts like the old string. Ensure some way that allows using 3rd party units that might use strings as bytearrays (such as compression and encryption units) without fixing or changing them.

Also I wanted to note that I'm all for the Unicode support. I'm from Latvia and I have to deal with at least 3 different languages - Latvian, Russian and English. Full Unicode in Delphi would be a real blessing. I still insist that it should be added on top of current Delphi mechanics, instead of changing things we have been relying on for decades, if you count Pascal.
Add new type (or use WideString), change all standard VCL controls to use the new type, the 3rd party control developers will gladly update their controls, make basic string functions (copy, delete, etc) support the new type as well and that's it. What am I missing here, why can't this be done?

If you haven't completely decided about the full Unicode implementation, feel free to e-mail me to -my name-@gmail.com to discuss this further. I want to get involved somehow.

Anonymous said...

If you need to avoid using of Unicode strings just replace String types to AnsiString now, and your program will work in Delphi 2008 identically as in the previous versions of Delphi.

Anonymous said...

Who is going to pay for development hours wasted on fixing the compatibility errors? Who is going to pay for testing? This kind of a change might seem cool for geeky persons and young programmers, but for me as a software business owner and project manager it will be a nightmare. We have like zillion lines of Pascal code in most of our projects, including tons of 3rd party VCL used.

The bottom line is - this kind of change will hurt businesses who have extensive Pascal code base (loyal customers).

Yes, we can choose not to move to Delphi 2008, but then again, how long will the present versions be supported and how long will Delphi 2007 be able to create code that runs on actual version of Windows? I do love Delphi, but can't understand why does codeGear wants us to suffer so much.

Unknown said...

there are such problems in CodeGear RAD 2007 already

DBCommon.pas

procedure TExprParser.NextToken;
var
...
StrBuf: array[0..255] of WideChar;
...
begin
...
if L < SizeOf(StrBuf) then
begin
StrBuf[L] := P^;
Inc(L);
end;


Link to QA: http://qc.codegear.com/wc/qcmain.aspx?d=52511

Chris Bensen said...

Aivars,

String is not dead. This is very similar to any platform shift. In the move from Delphi 1 to Delphi 2 the old string type become ShortString and string was aliased to AnsiString. String is now going to become UnicodeString and you will be able to modify your code to function as it has in the past. No worries there.

I've been going through a lot of code lately so I'm making lists of problem areas and thought I'd post them here to make the platform shift easier.

Chris Bensen said...

Karlis,

This is a platform shift, so there will be changes. But those changes will be minimal if you don't use string to store data.

Chris Bensen said...

Ray,

Excellent find. We going through all of our code as we speak.

Anonymous said...

Here's another: WideStrUtils.pas

function EnumWideStringModules(Instance: Longint; Data: Pointer): Boolean;
var
Buffer: array [0..1023] of widechar;
...
SetString(Str, Buffer,
LoadStringW(Instance, Ident, Buffer, sizeof(Buffer)));
...

Anonymous said...

I do not understand the position of Aivars and Karlis. First of all, I'm pretty sure that Codegear will provide some compatibility switches in the form of compiler options or something like that.

But - there is no way to the future, staying aside from unicode. I guess any developer writing multi-language programs will support me. And I seriously considered moving away from Delphi for this one and only reason - lack of unicode support.

Actually, there is a question, what the unicode support in D2008 will look like. Is there any official statement that it will be based on WideString? Or perhaps some kind of utf-8 support? That would solve part of compatibility problems - if your strings contain only the 1..127, you will feel no difference.

Chris Bensen said...

Roddy,

Another good find. Thanks!

Chris Bensen said...

Mike,

I can't say exactly what Tiberon will deliver, but I guarentee our customers are one of the first things we are thinking about. If this platform shift isn't fairly straight forward then nobody will upgrade and who wants that?

Windows is UTF-16 so I think it is safe to assume something with UTF-16 will be supported.

Anonymous said...

Why all this hula-bula on Unicode.

Why not implement Unicode in a way that it does not harass any developer like how it is implemented in VB 6 by M$.

Every string is stored internally as Unicode and when processed it is automatically converted to ANSI and visa versa.

There are already many VCLs available which support Unicode completely. Why go through all this trouble?

Post a Comment