Iterate over individual bytes in Python 3

Python Programming

Question or problem about Python programming:

When iterating over a bytes object in Python 3, one gets the individual bytes as ints:

>>> [b for b in b'123']
[49, 50, 51]

How to get 1-length bytes objects instead?

The following is possible, but not very obvious for the reader and most likely performs bad:

>>> [bytes([b]) for b in b'123']
[b'1', b'2', b'3']

How to solve the problem:

Solution 1:

If you are concerned about performance of this code and an int as a byte is not suitable interface in your case then you should probably reconsider data structures that you use e.g., use str objects instead.

You could slice the bytes object to get 1-length bytes objects:

L = [bytes_obj[i:i+1] for i in range(len(bytes_obj))]

There is PEP 0467 — Minor API improvements for binary sequences that proposes bytes.iterbytes() method:

>>> list(b'123'.iterbytes())
[b'1', b'2', b'3']

Solution 2:

int.to_bytes

int objects have a to_bytes method which can be used to convert an int to its corresponding byte:

>>> import sys
>>> [i.to_bytes(1, sys.byteorder) for i in b'123']
[b'1', b'2', b'3']

As with some other other answers, it’s not clear that this is more readable than the OP’s original solution: the length and byteorder arguments make it noisier I think.

struct.unpack

Another approach would be to use struct.unpack, though this might also be considered difficult to read, unless you are familiar with the struct module:

>>> import struct
>>> struct.unpack('3c', b'123')
(b'1', b'2', b'3')

(As jfs observes in the comments, the format string for struct.unpack can be constructed dynamically; in this case we know the number of individual bytes in the result must equal the number of bytes in the original bytestring, so struct.unpack(str(len(bytestring)) + 'c', bytestring) is possible.)

Performance

>>> import random, timeit
>>> bs = bytes(random.randint(0, 255) for i in range(100))

>>> # OP's solution
>>> timeit.timeit(setup="from __main__ import bs",
                  stmt="[bytes([b]) for b in bs]")
46.49886950897053

>>> # Accepted answer from jfs
>>> timeit.timeit(setup="from __main__ import bs",
                  stmt="[bs[i:i+1] for i in range(len(bs))]")
20.91463226894848

>>>  # Leon's answer
>>> timeit.timeit(setup="from __main__ import bs", 
                  stmt="list(map(bytes, zip(bs)))")
27.476876026019454

>>> # guettli's answer
>>> timeit.timeit(setup="from __main__ import iter_bytes, bs",        
                  stmt="list(iter_bytes(bs))")
24.107485140906647

>>> # user38's answer (with Leon's suggested fix)
>>> timeit.timeit(setup="from __main__ import bs", 
                  stmt="[chr(i).encode('latin-1') for i in bs]")
45.937552741961554

>>> # Using int.to_bytes
>>> timeit.timeit(setup="from __main__ import bs;from sys import byteorder", 
                  stmt="[x.to_bytes(1, byteorder) for x in bs]")
32.197659170022234

>>> # Using struct.unpack, converting the resulting tuple to list
>>> # to be fair to other methods
>>> timeit.timeit(setup="from __main__ import bs;from struct import unpack", 
                  stmt="list(unpack('100c', bs))")
1.902243083808571

struct.unpack seems to be at least an order of magnitude faster than other methods, presumably because it operates at the byte level. int.to_bytes, on the other hand, performs worse than most of the “obvious” approaches.

Solution 3:

since python 3.5 you can use % formatting to bytes and bytearray:

[b'%c' % i for i in b'123']

output:

[b'1', b'2', b'3']

the above solution is 2-3 times faster than your initial approach, if you want a more fast solution I will suggest to use numpy.frombuffer:

import numpy as np
np.frombuffer(b'123', dtype='S1')

output:

array([b'1', b'2', b'3'], 
      dtype='|S1')

The second solution is ~10% faster than struct.unpack (I have used the same performance test as @snakecharmerb, against 100 random bytes)

Solution 4:

I thought it might be useful to compare the runtimes of the different approaches so I made a benchmark (using my library simple_benchmark):

enter image description here

Probably unsurprisingly the NumPy solution is by far the fastest solution for large bytes object.

But if a resulting list is desired then both the NumPy solution (with the tolist()) and the struct solution are much faster than the other alternatives.

I didn’t include guettlis answer because it’s almost identical to jfs solution just instead of a comprehension a generator function is used.

import numpy as np
import struct
import sys

from simple_benchmark import BenchmarkBuilder
b = BenchmarkBuilder()

@b.add_function()
def jfs(bytes_obj):
    return [bytes_obj[i:i+1] for i in range(len(bytes_obj))]

@b.add_function()
def snakecharmerb_tobytes(bytes_obj):
    return [i.to_bytes(1, sys.byteorder) for i in bytes_obj]

@b.add_function()
def snakecharmerb_struct(bytes_obj):
    return struct.unpack(str(len(bytes_obj)) + 'c', bytes_obj)

@b.add_function()
def Leon(bytes_obj):
    return list(map(bytes, zip(bytes_obj)))

@b.add_function()
def rusu_ro1_format(bytes_obj):
    return [b'%c' % i for i in bytes_obj]

@b.add_function()
def rusu_ro1_numpy(bytes_obj):
    return np.frombuffer(bytes_obj, dtype='S1')

@b.add_function()
def rusu_ro1_numpy_tolist(bytes_obj):
    return np.frombuffer(bytes_obj, dtype='S1').tolist()

@b.add_function()
def User38(bytes_obj):
    return [chr(i).encode() for i in bytes_obj]

@b.add_arguments('byte object length')
def argument_provider():
    for exp in range(2, 18):
        size = 2**exp
        yield size, b'a' * size

r = b.run()
r.plot()

Solution 5:

A trio of map(), bytes() and zip() does the trick:

>>> list(map(bytes, zip(b'123')))
[b'1', b'2', b'3']

However I don’t think that it is any more readable than [bytes([b]) for b in b'123'] or performs better.

Solution 6:

I use this helper method:

def iter_bytes(my_bytes):
    for i in range(len(my_bytes)):
        yield my_bytes[i:i+1]

Works for Python2 and Python3.

Solution 7:

A short way to do this:

[chr(i).encode() for i in b'123']