Genji-mon

Frank Ellermann nobody at xyzzy.claranet.de
Thu Dec 18 07:12:24 CET 2003


y.kohmoto wrote:

>>> the number of all characters in Unicode = 2^16 is too few
[...]
>> Unicode allows also 32-bit codes, and indeed many lesser
>> used historic scripts are now placed there (over code point
>> 65535
 
> I have thought that Unicode is 16-bit codes, if it allows 
> 32-bit then it is enough.

Essentially it's only an enumeration, and there's no limit.
Any limits you find depend on the encoding, e.g. UTF-8, UTF-7
(unusual), UTF-16 (+ direction), and maybe more.  UTF-8 is
good enough to encode UCS-4 (that's Unicode upto 4 * 8 bits).

We could create a finite sequence for UTF-8, "max. UCS-4 code
point encoded by 1..6 UTF-8 bytes":

a(1) = 0xxxxxxx              = 2^7  -1 = 127
a(2) = 110xxxxx 10xxxxxx     = 2^11 -1 = 2047
a(3) = 1110xxxx + 2 * 6 bits = 2^16 -1 = 65535
a(4) = 11110xxx + 3 * 6 bits = 2^21 -1 = 2097151
a(5) = 111110xx + 4 * 6 bits = 2^26 -1 = 67108863
a(6) = 1111110x + 5 * 6 bits = 2^31 -1 = 2147483647

> A popular Kanji dictionary "Dai Ji Rin" contains 200000
> Kanjis.

No problem with UTF-8.  In theory you could add a(7) 11111110
for up to 2^36 -1 code points.
                              Bye, Frank






More information about the SeqFan mailing list