Monday, 25 August 2014

Excel doesn't render correctly UTF-8 characters when it opens CSVs

Scenario #1: I create a CSV in UTF-8 encoding with Java, save the character 'è'
but Excel doesn't render it well.

Scenario #2: I create a CSV in UTF-8 encoding with Notepad, save the character 'è'
and Excel renders it well.

Explanation (from Wikipedia)
"BOM use is optional, and, if used, should appear at the start of the text stream.
... the BOM character may also indicate which of the several Unicode representations
the text is encoded in."
"In UTF-16, a BOM (U+FEFF) may be placed as the first character of a file ... "       (Java is UTF-16)
 "The UTF-8 representation of the BOM is the byte sequence 0xEF,0xBB,0xBF." (Windows is UTF-8)
 "many pieces of software on Microsoft Windows such as Notepad will not correctly
 read UTF-8 text unless it has only ASCII characters or it starts with the BOM,
 and will add a BOM to the start when saving text as UTF-8"

 => Scenario #1: notepad saves a CSV UTF-8 with the BOM U+EFBBBF therefore Excel opens it correctly
 => Scenario #2: but when the CSV UTF-8 is created with Java the Excel doesn't open it correctly

Demo on how to solve the problem on Scenario #1:
  Adding the prefix BOM U+EFBBBF on the CSV created programmatically

and now opening the CSV with excel  and it is correct!

Code solution for scenario #1 (creating CSV programmatically): add to generated CSV the
BOM suffix U+EFBBBF (for UTF-8 Operating Systems like Windows)

How to do it in Java
Java is UTF-16; therefore you have to write the UTF-16 BOM U+FEFF:

// BOM prefix U+FEFF for exported CSVs so that if the csv is openned with
// excel in UTF-8 OS (e.g. Windows) encoding is ok (e.g. char è)
char CSV_BOM = '\uFEFF';

writer.write(CSV_BOM); // writer is e.g. a BufferedWriter

Note: if you open the exported file with a hex editor like I did on the last image you will see the BOM U+EFBBBF.

No comments:

Post a Comment