File: encoding.txt

package info (click to toggle)

fiona 1.10.1-4

links: PTS, VCS
area: main
in suites: forky, sid
size: 2,632 kB
sloc: python: 12,616; makefile: 214; sh: 45

file content (59 lines) | stat: -rw-r--r-- 1,941 bytes

parent folder | download | duplicates (6)

=========================
Fiona and String Encoding
=========================

Reading
-------

With Fiona, all 'str' type record attributes are unicode strings. The source
data is encoded in some way. It might be a standard encoding (ISO-8859-1 or
UTF-8) or it might be a format-specific encoding. How do we get from encoded
strings to Python unicode? ::

  encoded File | (decode?) OGR (encode?) | (decode) Fiona
  
                E_f           R           E_i

The internal encoding `E_i` is used by the ``FeatureBuilder`` class to create
Fiona's record dicts. `E_f` is the encoding of the data file. `R` is ``True``
if OGR is recoding record attribute values to UTF-8 (a recent feature that
isn't implemented for all format drivers, hence the question marks in the
sketch above), else ``False``.

The value of E_i is determined like this::

  E_i = (R and 'utf-8') or E_f

In the real world of sloppy data, we may not know the exact encoding of the
data file. Fiona's best guess at it is this::

  E_f = E_u or (R and E_o) or (S and 'iso-8859-1') or E_p

`E_u`, here, is any encoding provided by the programmer (through the
``Collection`` constructor). `E_o` is an encoding detected by OGR (which
doesn't provide an API to get the detected encoding). `S` is ``True`` if the 
file is a Shapefile (because that's the format default). `E_p` is
locale.getpreferredencoding().

Bottom line: if you know that your data file has an encoding other than
ISO-8859-1, specify it. If you don't know what the encoding is, you can let the
format driver try to figure it out (Requires GDAL 1.9.1+).

Writing
-------

On the writing side::

  Fiona (encode) | (decode?) OGR (encode?) | encoded File
  
                E_i           R           E_f

We derive `E_i` from `R` and `E_f` again as above. `E_f` is::

  E_f = E_u or (S and 'iso-8859-1') or E_p

Appending
---------

The diagram is the same as above, but `E_f` is as in the Reading section.