[seqfan] Re: Windows text editor

Donald Alan Morrison donmorrison at gmail.com
Thu Jan 20 01:54:11 CET 2011


On 1/19/11 4:10 PM, Russ Cox wrote:
>> kb> Russ Cox tood me that the new uploader expects text files to be encoded
>> kb> in UTF-8.
>>
>> I hope that b-files stay in plain ASCII. The occurrence of (non-ASCII)
>> byte-order-marks (BOM's) in UTF files (which are not shown by all editors by
>> default and make parsing more difficult in basically all standard languages)
>> is not a nice feature.
> 
> The b files have a strict format; byte order marks are not allowed.
> 
> Russ

UTF-8 does not require a BOM marker anyway.

http://en.wikipedia.org/wiki/UTF-8#Advantages

If the high bit is set in a byte, then that byte (sequence) is
non-ascii, and the convention is to assume utf-8 if no BOM exists.  So
either way, your decision flow chart is clear.

Has BOM? Yes->Reject
Is any high bit set? Yes->Reject if you don't want to try utf-8
validation, which would involve a lookup table of which "code points"
should be supported by a "standard" unicode font...like Linux Libertine,
Charis SIL, or Gentium, etc....

Donald



More information about the SeqFan mailing list