|
Fundamentals - Unicode units
These Unicode units provide common functions needed to use Unicode strings in your Delphi application.
The following units make up the collection:
- Unicode codecs
Unicode codecs are encoders and decoders for convertings various character
sets and encodings to and from Unicode WideStrings.
Character sets include: ISO8859, KOI8, MacLatin2 and MacCyrillic.
Encodings: UTF-8 and UTF-16.
Functions include: UTF8ToWideString, ISO8859_1ToUTF8 and ASCIIToWideString.
Source code (HTML)
- Unicode
More than 40 Unicode character property functions,
For example: IsWhiteSpace, IsUpperCase, IsLetter, IsPunctuation and DecimalDigitValue.
More than 30 Unicode string functions for using WideStrings and null terminated WideStrings,
For example: WideMatch, WidePos, WideTrim and WideReplace.
Source code (HTML)
- Unicode stream readers
Classes for manipulating Unicode strings from streams.
File and Memory stream reader implementations are provided.
Source code (HTML)
These functions are extremely fast. On a 450Mhz machine, the codecs
reach speeds of up to 40Mb/s and the reader up to 10Mb/s.
Using Unicode in your Delphi application
Unicode Encodings
Unicode consists of tens of thousands of characters, each of which has an
unique code. To be able to store any Unicode characters you'll need 4 bytes
(a 32-bit value) for every characters. This encoding of Unicode is called
UCS4.
To simplify matters, Unicode defines allmost all commonly
used characters in the first 65536 characters. This means that most Unicode
strings can be encoded using 2 bytes (a 16-bit value) for every character.
This encoding is called UCS2, which in Delphi is represented using
WideChar and WideString.
To simplify matters further, Unicode defines the first 128 characters to be
identical to the characters from ASCII. An encoding that makes use of this
fact is UTF-8. UTF-8 is a variable length encoding with several propeties that
makes it ideal for storing Unicode when the majority of characters are
ASCII characters. Among the properties of UTF-8 is:
- It stores ASCII characters as their ASCII value in one byte. In other words,
an ASCII string will not be changed by UTF-8.
- Non-ASCII character sequences are stored as more than one byte, and no
ASCII character will be part of that sequence. In other words, functions that
operate on ASCII strings can transparently work on UTF-8 strings.
Which encoding to use?
When you expect to work with lots of international text, use WideChar and
WideString. You should note that WideStrings are not reference counted
on Windows. This makes them less effiecient to use than LongStrings.
When you expect to work with text which is mostly ASCII, but which may contain
the occasional international text, use UTF8Strings. They use less memory and
are reference counted in Delphi.
How the Fundamentals Unicode library helps you
The functions in cUnicodeCodecs allow you to
convert various legacy characters sets (such as ISO-8859-1 which is the default
for HTML pages, ASCII and cyrillic encodings) to Unicode, which can be
stored as either WideStrings or UTF-8 Strings.
Once you have Unicode Strings, you can use the string functions in
cUnicode to work with the Unicode strings.
Unicode links
The Unicode consortium
Official site of the Unicode consortium. Here you will find the latest technical
documents and data on Unicode.
Unicode Transformation Formats (UTF)
Detailed information on the UTF encodings.
Unicode in XML and other Markup Languages
Technical guidelines on the use of Unicode in conjunction with markup languages such as XML.
Zvon Character Search
Allows you to search the Unicode character database online.
|
|