Tuesday, January 29, 2008

There Will Not be a Compiler Switch for Unicode

There have been a lot of questions about a compiler switch to change what type string maps to. Just to avoid any confusion the title of this post say's it all, but here are a few more specifics. String will map to UnicodeString. If you need AnsiString you still can use it, but string will not map to AnsiString.

If you think about it a switch isn't a viable solution. If there were a switch there would be three primary ways of doing it:

1. All generated files would contain both ANSI and unicode code doubling the size of each file.

2. All source files would be compiled twice to generate ANSI and unicode versions of all generated files.

3. The string type would have a run-time switch for ANSI or unicode.

Each of these options has advantages and disadvantages but the fact is each of these options add complexity. The simplest solution is ANSI or unicode. I personally choose unicode.

22 comments:

lgallion said...

I totally understand your reasoning from a programming elegance standpoint. The problem for me is that no Win 9x compatibility is a deal breaker while native Unicode support isn't a deal maker (currently I don't need it at all and my projected needs could probably be handled by the numerous Unicode components (I know they are not a full solution) available). I just worry that numerous developers are in my situation and Codegear may suffer an upgrade drought because of it. Here is my vote for dual versions for a few years (until 9x is truly dead), if a switch is not an option.

Chris Bensen said...

grudge,

I also understand your position. Trust me, the decision to go 100% unicode didn't come easy.

Anonymous said...

I find it somewhat disturbing that windows 9x is not considered dead - I personally would hate to have to go back to using windows 95, especially because you needed to reinstall it every year or so to unbreak it - which would probably mean over a dozen reinstalls by now - I think I would have upgraded to linux by now, wine is pretty good for most applications. Fortunately I don't think there are any of our clients left still running windows 9x (or worse Me).

x-ray said...

AFAIK, 9x/ME support has been gone since d2005 so i think incompatibility with 9x/ME is a moot point.

Anonymous said...

I don't get it. You missed a 4th option: Extend String RTTI to include encoding information.

Then String could alias to UnicodeString or AnsiString on the basis of a compiler switch without losing information and ensuring that the necessary information is contained in the string to facilitate the required RTL support/compiler warnings.

This was discussed at some length in the NGs - when first suggested NH posted a response that could (*could*) have been interpreted (and I did) as suggesting this was the way things were headed.

It certainly made - and makes - a lot of sense.

What happened?

Did NH post a vague, ambiguous response that was open to misinterpretation and then decline to clarify or correct those who had misinterpreted it, or was this the way things WERE going to go, but something has changed in the meantime?

Xepol said...

Actually, if you left the mapping of string to unicodestring or ansistring at compile time, then you would not need anything extra.

The units would internally be marked for ansistring or unicodestring, and they you would match against that. Either it matches or it doesn't. THen treat mismatches just like byte/word/cardinal sizing issues. Lost of detail requires extra steps. Passing an AnsiString to UnicodeString functions would transparently upgrade, and you couldn't interchange them on VAR, OUT.

I had assumed that this is how the switch that moves between shortstring and ansistring worked.

Xepol said...

Of course, without the switch, people will actually have to put some effort into making the transition.

Probably for the best in the long run to not have the switch even if it is technically feasible.

Anonymous said...

CodeGear customers that use D7/D2007 don't target W32 for fun; it's to reach a user base that's as wide as possible.

Whilst switching from ME to XP in order to use D2007 was something I accepted, losing W9x users in developing countries is more serious.

How do the C++ compilers solve this problem?

Hallvards New Blog said...

I'm actually with Xepol on this one.

Just like the current $H switch does not permute the RTL and VCL, a new string=AnsiString compiler switch should have to effect on the new RTL and VCL that will assume (and declare) that string=UnicodeString.

The new compiler switch would not be generally useful, only on a per-unit bases of existing code that assumes string=AnsiString. If that code depends on RTL/VCL code that now uses Unicode strings all over, the compiler would perform normal AnsiString -> Unicode promotions and (probably) warn on Unicode -> Ansistring lossy conversions and generate compile time errors on var/out/override mismatches.

Since 'string' is a reserved word, a local typedef like:

type
string = AnsiString;

wouldn't work either, so a compiler switch could come in handy - in the short-term.

stanleyxu said...

It is acceptable, when a project can be built with D2007 and D2008. Is this possible?

Anonymous said...

Hi Chris,

Yes, imho the path is good. But what's really missing (again imho :-) )
to make the transition more smoother is a refactoring which will use the
chill parser to search and replace different items in the code. For ex.
'String' declarations with with 'AnsiString', 'PChar' with 'PByte' aso.
In fact you have already this in the form of Ctrl+Shift+J engine applied
on current selection. What is missing is a dialog to choose on what to
apply this: Project, Unit, Selected files aso. See for more details QC
#56885 and QC #56886.

Chris Bensen said...

alister, In some regions Windows 9x is alive and well.

mike, Windows 9x support is something we don't target, but we try to not break the support we had if we can help it.

Jolyon, I didn't miss that option. UnicodeString and AnsiString share a very similar memory layout with encoding information. Exact implementation details have been vague because it is a work in progress. Any information I have been providing has been vague on details because I don't want to provide you with any false information. I don't know what was said by anyone in newsgroups, but we are not going to implement a string type that can change based on the encoding. This could result in data loss which is unacceptable. One additional thing to keep in mind is VCL is used by Delphi and C++Builder. We need this to be as compatible with both products as possible.

xepol, hallvard, if each unit could define what string is mapped to then what would function signatures export? A unit based compiler option would make reading code very difficult since string changes types. The best option is to change your strings to AnsiString if you need them that way and move forward. The change doesn't affect as much as you'd think.

stanleyxu, compiling with an older version of Delphi shouldn't be any more difficult that it has before. If there are new components, properties or language features then those won't be backward compatible but I'm sure you can program to the lowest common denominator if that is important.

m. Th, those are some interesting productivity features that should be taken into consideration.

Xepol said...

Actually, my assumption since D2 introduced the ansistring/shortstring compiler flag was that string was no longer an ACTUAL type, but rather just a macro replacement for either ansistring or shortstring based on what the compiler flag was, and that ultimately, ansistring or shortstring was what was really in the metadata.

If you are telling me that string is still an actual type that can be 2 different things in 2 different units and still be the same 'type', then I gotta say someone did something shortsighted and very, very wrong in D2.

Assuming it was implemented as a macro replacement based on a compiler switch to the ACTUAL type, a new flag for unicodestrings would actually be feasible (indeeed SIMPLE) now.

Probably wouldn't be to late to bolt it in like that either. I had assumed that INTEGER's flexibility meant it was mapped in a similar way, but I guess I was wrong. You can probably see where it would have been more useful to implement it that way tho, no?

Chris Bensen said...

Xepol, I don't work on the compiler, but I believe it is fairly flexible as you suggest. My point is there is more going on here than a simple type mapping. You have to take into account testing, delivery, VCL consumers, packages, etc. I guess you have to trust us that we want to do the best thing for Delphi and it is believed that this is the best option. Sure it is flawed, it isn't perfect. I'm trying to educate as many Delphi Programers as I can. Blog posts like this are intended to get straight to the point so there is no misinterpretation.

Anonymous said...

"This could result in data loss which is unacceptable"

Only in applications that knowingly mix Unicode and ANSI strings without ensuring correct conversion takes place.

Most existing applications are ANSI, by definition. By compelling those applications to go 100% Unicode (which they won't be, they will be ANSI applications struggling to operate in a Unicode world) you are surely increasing the risk of inadvertent data loss, not preventing it.

My guess - developers working on those applications (which is most if not all of them) will have no interest in a Unicode Delphi in this form.

Those that have a pressing need for Unicode... well, I doubt they have been hanging around waiting for Delphi - they'll be .net'ing it already or using TNT etc.

The only viable approach for a Unicode implementation in Delphi was one that allowed richness and variation. 100% ANSI or 100% Unicode and safety and reassurance when mixing the two (i.e. if/when making the transition).

I suspect that the underlying reason for going exclusively 100% Unicode is that this is the fastest way of getting a product to market, i.e./e.g. without requiring parallel ANSI and Unicode VCL or a VCL that can accomodate either/or.

The devils are in the detail (and keeping those details secret is a HUGE mistake imho) but I suspect this will be An Earliest to Market approach that finds that there isn't much of a market at all.

I hope I'm wrong.

:(

Hallvards New Blog said...

> if each unit could define what string is mapped to then what would function signatures export?

It would export "AnsiString" or "UnicodeString", depending on the compiler switch setting.

> A unit based compiler option would make reading code very difficult since string changes types.

No more difficult than changing $H+/- today. Almost no code actually change $H today, and I suspect that the same would be the case with a new Ansi/Unicode switch, but simply having it there (as a pure *compiler* switch only - no RTL/VCL implications at all), will give users a warm fuzzy feeling ;).

> The best option is to change your strings to AnsiString if you need them that way and move forward.

Maybe, maybe not. Giving the programmer the compiler switch option mean more flexibility. A lot of code needs to be compiled in multiple versions - and some code even in BP 7.

> The change doesn't affect as much as you'd think.

No, but the compiler switch shouldn't be as expesive to implement as you think (I think) ;).

Anonymous said...

I also cannot understand why it is feasible to maintain {$H} switch to select between UniCodeString/ShortString and not allow it to switch between UniCodeString/AnsiString/ShortString.

Can you explain why this is such a big problem? I don't mean that the VCL should come in both UniCode and Ansi versions - just that the STRING type should map to a selected string type.

This would allow one to use the STRING type like the INTEGER type - a type that says "this compiler's native string type" (just like the INTEGER type is an alias for "this compiler's native integer type") and would allow source code snippets to work in all three environments without change.

It's acceptable that you'd then get compiler errors in places where a VAR string type is of a different type - just like you do now with ShortString/AnsiString issues...

Can you explain why this road isn't feasible???

Patrick said...

Like Xepol said, I too thought string was just a macro for AnsiString.
Actually, that's how I interpreted this text in the help on Long strings (Delphi) :

"
The $H directive controls the meaning of the reserved word string when used alone in a type declaration.
The generic type string can represent either a long, dynamically-allocated string (the fundamental type AnsiString)
or a short, statically allocated string (the fundamental type ShortString).
"

So, the "generic type" string _represents_ AnsiString under $H+, while it _represents_ ShortString under $H-.
Back when $H was introduced, the use of 'string' in the RTL and VCL wasn't changed however.


Please consider that, in general, there are two sets of code:
- CodeGear's (mainly RTL, VCL and IDE)
- Ours (a much larger volume than CodeGear's I'm sure).

Changing the meaning of string to UnicodeString (with no way back) implies that all code which uses strings needs to be updated, verified and retested.
CodeGear is already in the proces of doing this, great!
But this change _forces_ us to do the same, but will cost us a _huge_ amount of time (or we just stay with D2007 ofcourse).


I'd like to argue that the change to Unicode should be done by updating the RTL,VCL and IDE to use UnicodeString explicitly.
This way, the meaning of 'string' can still be mapped to any one of ShortString, AnsiString or UnicodeString, just as the user desires.

Furthermore, new projects should have a setting which makes string in default represent UnicodeString,
while existing projects would keep their current setting (either AnsiString under $H+, or even ShortString under $H-).
This way, porting will be much cheaper, because we don't have to check up on every single use of string and PChar.

Note, I totally agree that both types have long been mis-used.
I promise to use PByte and AnsiString drom now on - as will many others I guess ;-)
(The $POINTERMATH Allen blogged about will be a huge improvement too!)


One last question I haven't seen answered anywhere yet:

Will we be able to set the default encoding of UnicodeString?

Every project will have it's own environment in which it will run,
and as such benefits the most from either using UTF-8, UTF-16 or whatever encoding applies best to the situation.

Anonymous said...

Patrick,

You suggest using a UnicodeString VCL/RTL combined with your won AnsiString units. I'm not sure whether that really solves your problem. I mean, your code likely calls RTL functions and uses TStringList etc. which are now unicode. Doesn't this mean you'd have to retest all your code anyway?

Xepol said...

Chris -> while I feel that leaving the switch out will drive migration to the new unicode system and avoid loss of data when interacting with the VCL AND ultimately benefit everyone, I am merely not believing what I am told about the technical problems of introducing the switch.

Ultimately, I think leaving the switch out with the intent of driving long term adoption is probably wise. I just object to being told something isn't being done because it is technically infeasible when it has already been accomplished, and doing it again should just be relatively simple extension of that work.

My objections are to the reasons given for the decission, not the decission itself. We're all big boys and girls, we can take it, honest.

(besides, we are all problem solvers. When you tell something is a problem, we're all going to try to solve it, at least in principle, before we agree...)

Anonymous said...

@ Xepol

I think the underlying reason is abundantly clear and comprises two technical aspects and one non-technical:

T1: To support a migratory path from ANSI applications to Unicode would have required either two parallel VCLs - one ANSI, one Unicode - or a unified ANSI/Unicode VCL able to handle both, and accompanying IDE/design-time package interop issues.

In fact, this could be moot- if anything was acceptable to go unilaterally "100% Unicode" the VCL was probably it.


T2: Enabling applications to mix string types with compiler warnings where appropriate (assigning a Unicode string to an ANSI or vice versa - codepage required warning) and RTL support where possible.

In the case of UniCode - ANSI assignment, for example, extending RTTI for strings to include full encoding metrics would have enabled this.

i.e. for an ANSIString the RTTI would identify it as ANSI and additionally identify it's codepage (perhaps as simple as "system" or "not-system").

For a Unicode String the RTTI would identify the encoding - UTF8, UTF16 etc.

The the compiler could inject lossless conversions where necessary and emit warnings where appropriate, but applications would be free to mix ANSI and Unicode and to use such Unicode encodings as were most appropriate.


NT: The non-technical issue....

I don't believe that either of the above issues were insurmountable. I suspect that the overriding problem was one of TIME.

CodeGear have to get a release out in Q1 2008 to justify their promotion of SA and to satisfy a need for a revenue spike in this Q.

They may hit their delivery target but I somewhat doubt they will see the size of spike they are hoping for if they expect all those developers with ANSI applications to willingly stump up for an un upgrade that is going to incurr a wide range of penalties in both migrating their code but also in the runtimes of their code.

Xepol said...

I doubt we'll see a Q1 release.

I'm on SA, and I doubt that a Q1 release would justify it to me anyways. Good feedback like is going with the Unicode stuff right now does tho.

I kinda like having a clue that I'm gonna get shipped another goose egg like D2005.

Actually, being guarenteed a beta seat to all SA members would probably be enough for most people to go SA.

Post a Comment