File: encoding.md

package info (click to toggle)
php-league-csv 9.24.1%2Bdfsg-2
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 39,744 kB
  • sloc: php: 13,447; javascript: 80; makefile: 33; xml: 29
file content (90 lines) | stat: -rw-r--r-- 3,589 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
---
layout: default
title: CSV document charset encoding
---

# BOM and CSV encoding

Depending on the software you are using to read your CSV you may need to adjust the CSV document outputting script.

<p class="message-warning">To work as intended you CSV object needs to support the <a href="/9.0/connections/filters/">stream filter API</a></p>

## MS Excel on Windows

On Windows, MS Excel expects an UTF-8 encoded CSV with its corresponding `BOM` character. To fulfill this requirement, you simply need to add the `UTF-8` `BOM` character if needed as explained below:

```php
use League\Csv\Bom;
use League\Csv\Reader;

$reader = Reader::createFromPath('/path/to/my/file.csv', 'r');
//let's set the output BOM
$reader->setOutputBOM(Bom::Utf8);
//let's convert the incoming data from iso-88959-15 to utf-8
$reader->appendStreamFilter('convert.iconv.ISO-8859-15/UTF-8');
//BOM detected and adjusted for the output
echo $reader->getContent();
```

<p class="message-info">The conversion is done with the <code>iconv</code> extension using its bundled stream filters.</p>

## MS Excel on MacOS

On a MacOS system, MS Excel requires a CSV encoded in `UTF-16 LE` using the `tab` character as delimiter. Here's an example on how to meet those requirements using the `League\Csv` package.

```php
use League\Csv\Bom;
use League\Csv\CharsetConverter;
use League\Csv\Reader;
use League\Csv\Writer;

//the current CSV is ISO-8859-15 encoded with a ";" delimiter
$origin = Reader::createFromPath('/path/to/french.csv', 'r');
$origin->setDelimiter(';');

//let's use stream resource
$writer = Writer::createFromStream(fopen('php://temp', 'r+'));
//let's set the output BOM
$writer->setOutputBOM(Bom::Utf16Le);
//we set the tab as the delimiter character
$writer->setDelimiter("\t");
//let's convert the incoming data from iso-88959-15 to utf-16
CharsetConverter::addTo($writer, 'ISO-8859-15', 'UTF-16');
//we insert csv data
$writer->insertAll($origin);
//all is good let's output the results
$writer->output('mycsvfile.csv');
```

<p class="message-info">The conversion is done with the <code>mbstring</code> extension using the <a href="/9.0/converter/charset/">League\Csv\CharsetConverter</a>.</p>

## Skipping The BOM sequence with the Reader class

<p class="message-info">Since version <code>9.9.0</code>.</p>

In order to ensure the correct removal of the sequence and avoid bugs while parsing the CSV, the filter can skip the
BOM sequence completely when using the `Reader` class and convert the CSV content from the BOM sequence encoding charset
to UTF-8. To work as intended call the `Reader::includeInputBOM` method to ensure the default BOM removal behaviour is disabled
and add the stream filter to you reader instance using the static method `CharsetConverter::skipBOM` method;

```php
<?php

use League\Csv\Bom;
use League\Csv\Reader;
use League\Csv\CharsetConverter;

$input = Bom::Utf16Be->value."john,doe,john.doe@example.com\njane,doe,jane.doe@example.com\n";
$document = Reader::createFromString($input);
$document->includeInputBOM(); // de-activate the default skipping mechanism
CharsetConverter::addBOMSkippingTo($document);
var_dump([...$document]);
// returns the document content without the skipped BOM sequence 
// [
//     ['john', 'doe', 'john.doe@example.com'],
//     ['jane', 'doe', 'jane.doe@example.com'],
// ]
```

<p class="message-warning">Once the filter is applied, the <code>Reader</code> instance looses all information regarding its
own BOM sequence. <strong>The sequence is still be present but the instance is no longer able to detect it</strong>.</p>