File: ne_8.html

package info (click to toggle)
ne 1.38-1
  • links: PTS
  • area: main
  • in suites: sarge
  • size: 3,164 kB
  • ctags: 2,701
  • sloc: ansic: 16,620; perl: 380; makefile: 137; sh: 10
file content (123 lines) | stat: -rw-r--r-- 6,300 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
<HTML>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<!-- Created on September, 27  2004 by texi2html 1.64 -->
<!-- 
Written by: Lionel Cons <Lionel.Cons@cern.ch> (original author)
            Karl Berry  <karl@freefriends.org>
            Olaf Bachmann <obachman@mathematik.uni-kl.de>
            and many others.
Maintained by: Olaf Bachmann <obachman@mathematik.uni-kl.de>
Send bugs and suggestions to <texi2html@mathematik.uni-kl.de>
 
-->
<HEAD>
<TITLE><CODE>ne</CODE>'s manual: The Encoding Mess</TITLE>

<META NAME="description" CONTENT="<CODE>ne</CODE>'s manual: The Encoding Mess">
<META NAME="keywords" CONTENT="<CODE>ne</CODE>'s manual: The Encoding Mess">
<META NAME="resource-type" CONTENT="document">
<META NAME="distribution" CONTENT="global">
<META NAME="Generator" CONTENT="texi2html 1.64">

</HEAD>

<BODY LANG="" BGCOLOR="#FFFFFF" TEXT="#000000" LINK="#0000FF" VLINK="#800080" ALINK="#FF0000">

<A NAME="SEC178"></A>
<TABLE CELLPADDING=1 CELLSPACING=1 BORDER=0>
<TR><TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="ne_7.html#SEC177"> &lt; </A>]</TD>
<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="ne_9.html#SEC179"> &gt; </A>]</TD>
<TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="ne_9.html#SEC179"> &lt;&lt; </A>]</TD>
<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="ne.html#SEC_Top"> Up </A>]</TD>
<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="ne_9.html#SEC179"> &gt;&gt; </A>]</TD>
<TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="ne.html#SEC_Top">Top</A>]</TD>
<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="ne_toc.html#SEC_Contents">Contents</A>]</TD>
<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="ne_13.html#SEC183">Index</A>]</TD>
<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="ne_abt.html#SEC_About"> ? </A>]</TD>
</TR></TABLE>
<H1> 8. The Encoding Mess </H1>
<!--docid::SEC178::-->
<P>

The original <CODE>ne</CODE> handled 8-bit text files, and assumed that every
byte coming from the keyboard could be output to the terminal. No other
assumption was made--for instance, the up/down casing functions did not
assume a particular encoding for non-US-ASCII characters. This choice
had a significant advantage: <CODE>ne</CODE> could handle easily several
different encodings, with minor nuisances for the end user.
</P><P>

Since version 1.30, <CODE>ne</CODE> supports UTF-8. It can use UTF-8 for its
input/output, and it can also interpret one of his buffers as containing
UTF-8 encoded text, acting accordingly. Note that the buffer content is
actual UTF-8 text---<CODE>ne</CODE> does not use wide characters. As a
positive side-effect, <CODE>ne</CODE> can support fully the ISO-10646
standard, but nonetheless non-UTF-8 texts occupy exactly one byte per
character.
</P><P>

More precisely, <EM>any</EM> piece of text in <CODE>ne</CODE> is classified as
US-ASCII, 8-bit or UTF-8. A US-ASCII text contains only US-ASCII
characters. An 8-bit text sports a one-to-one correspondence between
characters and bytes, whereas an UTF-8 text is interpreted in UTF-8.  Of
course, this rises a difficult question: <EM>when</EM> should a buffer be
classified as UTF-8?
</P><P>

Character encodings are a mess. There is nothing we can do to change
this fact, as character encodings are <EM>metadata that modify data
semantics</EM>. The same file may represent different texts of different
lengths when interpreted with different encodings. Thus, there is no safe
way of guessing the encoding of a file.
</P><P>

<CODE>ne</CODE> stays on the safe side: it will never try to convert a file
from an encoding to another one. It can, however, interpret data
contained in a buffer depending on an encoding: in other words,
encodings are truly treated as metadata. You can switch off UTF-8
at any time, and see the same buffer as a standard 8-bit file.
</P><P>

Moreover, <CODE>ne</CODE> uses a <EM>lazy</EM> approach to the problem: first of
all, unless the UTF-8 automatic detection flag is set
(see section <A HREF="ne_4.html#SEC123">4.9.26 UTF8Auto</A>), no attempt is ever made to consider a file as UTF-8
encoded.  Every file, clip, command line, etc., is firstly scanned for
non-US-ASCII characters: if it is entirely made of US-ASCII characters,
it is classified as US-ASCII. An US-ASCII piece of text is compatible
with anything else--it may be pasted in any buffer, or, if it is a
buffer, it may accept any form of text. Buffers classified as US-ASCII
are distinguished by an <SAMP>`A'</SAMP> on the status bar.
</P><P>

As soon as a user action forces a choice of encoding (e.g., an accented
character is typed, or an UTF-8-encoded clip is pasted), <CODE>ne</CODE> fixes
the mode to 8-bit or UTF-8 (when there is a choice, this depends on
the value of the <A HREF="ne_4.html#SEC123">4.9.26 UTF8Auto</A> flag). Of course, in some cases this may
be impossible, and in that case an error will be reported.
</P><P>

All this happens behind the scenes, and it is designed so that in 99% of
the cases there is no need to think of encodings. In any case, should
not <CODE>ne</CODE> behaviour match your needs, you can always change at run
time the level of UTF-8 support.
</P><P>

<A NAME="Some Notes for the Amiga User"></A>
<HR SIZE="6">
<TABLE CELLPADDING=1 CELLSPACING=1 BORDER=0>
<TR><TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="ne_9.html#SEC179"> &lt;&lt; </A>]</TD>
<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="ne_9.html#SEC179"> &gt;&gt; </A>]</TD>
<TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT"> &nbsp; <TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="ne.html#SEC_Top">Top</A>]</TD>
<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="ne_toc.html#SEC_Contents">Contents</A>]</TD>
<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="ne_13.html#SEC183">Index</A>]</TD>
<TD VALIGN="MIDDLE" ALIGN="LEFT">[<A HREF="ne_abt.html#SEC_About"> ? </A>]</TD>
</TR></TABLE>
<BR>  
<FONT SIZE="-1">
This document was generated
by <I>Sebastiano Vigna</I> on <I>September, 27  2004</I>
using <A HREF="http://www.mathematik.uni-kl.de/~obachman/Texi2html
"><I>texi2html</I></A>

</BODY>
</HTML>