Question or problem about Python programming:
I need to get a line count of a large file (hundreds of thousands of lines) in python. What is the most efficient way both memory- and time-wise?
At the moment I do:
def file_len(fname): with open(fname) as f: for i, l in enumerate(f): pass return i + 1
is it possible to do any better?
How to solve the problem:
You can’t get any better than that.
After all, any solution will have to read the entire file, figure out how many
\n you have, and return that result.
Do you have a better way of doing that without reading the entire file? Not sure… The best solution will always be I/O-bound, best you can do is make sure you don’t use unnecessary memory, but it looks like you have that covered.
One line, probably pretty fast:
num_lines = sum(1 for line in open('myfile.txt'))
I believe that a memory mapped file will be the fastest solution. I tried four functions: the function posted by the OP (
opcount); a simple iteration over the lines in the file (
simplecount); readline with a memory-mapped filed (mmap) (
mapcount); and the buffer read solution offered by Mykola Kharechko (
I ran each function five times, and calculated the average run-time for a 1.2 million-line text file.
Windows XP, Python 2.5, 2GB RAM, 2 GHz AMD processor
Here are my results:
mapcount : 0.465599966049 simplecount : 0.756399965286 bufcount : 0.546800041199 opcount : 0.718600034714
Edit: numbers for Python 2.6:
mapcount : 0.471799945831 simplecount : 0.634400033951 bufcount : 0.468800067902 opcount : 0.602999973297
So the buffer read strategy seems to be the fastest for Windows/Python 2.6
Here is the code:
from __future__ import with_statement import time import mmap import random from collections import defaultdict def mapcount(filename): f = open(filename, "r+") buf = mmap.mmap(f.fileno(), 0) lines = 0 readline = buf.readline while readline(): lines += 1 return lines def simplecount(filename): lines = 0 for line in open(filename): lines += 1 return lines def bufcount(filename): f = open(filename) lines = 0 buf_size = 1024 * 1024 read_f = f.read # loop optimization buf = read_f(buf_size) while buf: lines += buf.count('\n') buf = read_f(buf_size) return lines def opcount(fname): with open(fname) as f: for i, l in enumerate(f): pass return i + 1 counts = defaultdict(list) for i in range(5): for func in [mapcount, simplecount, bufcount, opcount]: start_time = time.time() assert func("big_file.txt") == 1209138 counts[func].append(time.time() - start_time) for key, vals in counts.items(): print key.__name__, ":", sum(vals) / float(len(vals))
I had to post this on a similar question until my reputation score jumped a bit (thanks to whoever bumped me!).
All of these solutions ignore one way to make this run considerably faster, namely by using the unbuffered (raw) interface, using bytearrays, and doing your own buffering. (This only applies in Python 3. In Python 2, the raw interface may or may not be used by default, but in Python 3, you’ll default into Unicode.)
Using a modified version of the timing tool, I believe the following code is faster (and marginally more pythonic) than any of the solutions offered:
def rawcount(filename): f = open(filename, 'rb') lines = 0 buf_size = 1024 * 1024 read_f = f.raw.read buf = read_f(buf_size) while buf: lines += buf.count(b'\n') buf = read_f(buf_size) return lines
Using a separate generator function, this runs a smidge faster:
def _make_gen(reader): b = reader(1024 * 1024) while b: yield b b = reader(1024*1024) def rawgencount(filename): f = open(filename, 'rb') f_gen = _make_gen(f.raw.read) return sum( buf.count(b'\n') for buf in f_gen )
This can be done completely with generators expressions in-line using itertools, but it gets pretty weird looking:
from itertools import (takewhile,repeat) def rawincount(filename): f = open(filename, 'rb') bufgen = takewhile(lambda x: x, (f.raw.read(1024*1024) for _ in repeat(None))) return sum( buf.count(b'\n') for buf in bufgen )
Here are my timings:
function average, s min, s ratio rawincount 0.0043 0.0041 1.00 rawgencount 0.0044 0.0042 1.01 rawcount 0.0048 0.0045 1.09 bufcount 0.008 0.0068 1.64 wccount 0.01 0.0097 2.35 itercount 0.014 0.014 3.41 opcount 0.02 0.02 4.83 kylecount 0.021 0.021 5.05 simplecount 0.022 0.022 5.25 mapcount 0.037 0.031 7.46
You could execute a subprocess and run
wc -l filename
import subprocess def file_len(fname): p = subprocess.Popen(['wc', '-l', fname], stdout=subprocess.PIPE, stderr=subprocess.PIPE) result, err = p.communicate() if p.returncode != 0: raise IOError(err) return int(result.strip().split())