Question or problem about Python programming:
I’m reading a series of source code files using Python and running into a unicode BOM error. Here’s my code:
bytes = min(32, os.path.getsize(filename)) raw = open(filename, 'rb').read(bytes) result = chardet.detect(raw) encoding = result['encoding'] infile = open(filename, mode, encoding=encoding) data = infile.read() infile.close() print(data)
As you can see, I’m detecting the encoding using chardet, then reading the file in memory and attempting to print it. The print statement fails on Unicode files containing a BOM with the error:
I’m guessing it’s trying to decode the BOM using the default character set and it’s failing. How do I remove the BOM from the string to prevent this?
How to solve the problem:
There is no reason to check if a BOM exists or not,
utf-8-sig manages that for you and behaves exactly as
utf-8 if the BOM does not exist:
# Standard UTF-8 without BOM >>> b'hello'.decode('utf-8') 'hello' >>> b'hello'.decode('utf-8-sig') 'hello' # BOM encoded UTF-8 >>> b'\xef\xbb\xbfhello'.decode('utf-8') '\ufeffhello' >>> b'\xef\xbb\xbfhello'.decode('utf-8-sig') 'hello'
In the example above, you can see
utf-8-sig correctly decodes the given string regardless of the existence of BOM. If you think there is even a small chance that a BOM character might exist in the files you are reading, just use
utf-8-sig and not worry about it
BOM characters should be automatically stripped when decoding UTF-16, but not UTF-8, unless you explicitly use the
utf-8-sig encoding. You could try something like this:
import io import chardet import codecs bytes = min(32, os.path.getsize(filename)) raw = open(filename, 'rb').read(bytes) if raw.startswith(codecs.BOM_UTF8): encoding = 'utf-8-sig' else: result = chardet.detect(raw) encoding = result['encoding'] infile = io.open(filename, mode, encoding=encoding) data = infile.read() infile.close() print(data)
I’ve composed a nifty BOM-based detector based on Chewie’s answer. It’s sufficient in the common use case where data can be either in a known local encoding or Unicode with BOM (that’s what text editors typically produce). More importantly, unlike
chardet, it doesn’t do any random guessing, so it gives predictable results:
def detect_by_bom(path, default): with open(path, 'rb') as f: raw = f.read(4) # will read less if the file is smaller # BOM_UTF32_LE's start is equal to BOM_UTF16_LE so need to try the former first for enc, boms in \ ('utf-8-sig', (codecs.BOM_UTF8,)), \ ('utf-32', (codecs.BOM_UTF32_LE, codecs.BOM_UTF32_BE)), \ ('utf-16', (codecs.BOM_UTF16_LE, codecs.BOM_UTF16_BE)): if any(raw.startswith(bom) for bom in boms): return enc return default
chardet detects BOM_UTF8 automatically since 2.3.0 version released on Oct 7, 2014:
#!/usr/bin/env python import chardet # $ pip install chardet # detect file encoding with open(filename, 'rb') as file: raw = file.read(32) # at most 32 bytes are returned encoding = chardet.detect(raw)['encoding'] with open(filename, encoding=encoding) as file: text = file.read() print(text)
chardet may return
'UTF-XXBE' encodings that leave the BOM in the text.
'BE' should be stripped to avoid it — though it is easier to detect BOM yourself at this point e.g., as in @ivan_pozdeev’s answer.
UnicodeEncodeError while printing Unicode text to Windows console, see Python, Unicode, and the Windows console.
I find the other answers overly complex. There is a simpler way that doesn’t need dropping down into the lower-level idiom of binary file I/O, doesn’t rely on a character set heuristic (
chardet) that’s not part of the Python standard library, and doesn’t need a rarely-seen alternate encoding signature (
utf-8-sig vs. the common
utf-8) that doesn’t seem to have an analog in the UTF-16 family.
The simplest approach I’ve found is dealing with BOM characters in Unicode, and letting the codecs do the heavy lifting. There is only one Unicode byte order mark, so once data is converted to Unicode characters, determining if it’s there and/or adding/removing it is easy. To read a file with a possible BOM:
BOM = '\ufeff' with open(filepath, mode='r', encoding='utf-8') as f: text = f.read() if text.startswith(BOM): text = text[1:]
This works with all the interesting UTF codecs (e.g.
utf-16be, …), doesn’t require extra modules, and doesn’t require dropping down into binary file processing or specific
To write a BOM:
text_with_BOM = text if text.startswith(BOM) else BOM + text with open(filepath, mode='w', encoding='utf-16be') as f: f.write(text_with_BOM)
This works with any encoding. UTF-16 big endian is just an example.
This is not, btw, to dismiss
chardet. It can help when you have no information what encoding a file uses. It’s just not needed for adding / removing BOMs.
A variant of @ivan_pozdeev’s answer for strings/exceptions (rather than files). I’m dealing with unicode HTML content that was stuffed in a python exception (see http://bugs.python.org/issue2517)
def detect_encoding(bytes_str): for enc, boms in \ ('utf-8-sig',(codecs.BOM_UTF8,)),\ ('utf-16',(codecs.BOM_UTF16_LE,codecs.BOM_UTF16_BE)),\ ('utf-32',(codecs.BOM_UTF32_LE,codecs.BOM_UTF32_BE)): if (any(bytes_str.startswith(bom) for bom in boms): return enc return 'utf-8' # default def safe_exc_to_str(exc): try: return str(exc) except UnicodeEncodeError: return unicode(exc).encode(detect_encoding(exc.content))
Alternatively, this much simpler code is able to delete non-ascii characters without much fuss:
def just_ascii(str): return unicode(str).encode('ascii', 'ignore')