Python > Exercises #7: File Statistics
Task: Write a short program that can count the number of lines, words and characters in a given text file – lemons.txt.
On Unix-like systems (Linux, BSD, MacOS X…) there is a standard wc
(word count) utility that does this:
$ wc lemons.txt
13 87 490 lemons.txt
Correct solution in Python should report 477 characters, but the file is 490 bytes large.
(Such differences are caused by encoding of newlines and/or accented characters. Opening a text file in Python 3 yields UTF-8 encoded characters and '\n'
linebreaks.)
Solution 1: file.read()
file.read()
will return the whole file as one long string:count the number of characters in the string (=in the file):
count the number of words using
str.split()
; by default it splits by whitespace, including spaces, tabs and newlines:count the number of lines using
str.split(delim)
– we’ll set the delimiter to be'\n'
, representing a newline:
Whole program:
with open('lemons.txt') as f:
text = f.read()
n_chars = len(text)
n_words = len(text.split())
n_lines = len(text.split('\n'))
print(n_chars, n_words, n_lines)
Solution 2: file.readlines()
file.read()
will return an array containing all the lines in the file:add up the number of characters in every line:
Alternatively, we can use the
sum(iterable)
builtin with a list comprehension:add up the number of words in every line:
the number of lines is trivial:
Whole program:
with open('lemons.txt') as f:
lines = f.readlines()
n_chars = sum([len(x) for x in lines])
n_words = sum([len(x.split()) for x in lines])
n_lines = len(lines)
print(n_chars, n_words, n_lines)
Solution 3: reading file by lines
Using
file.readline()
, we can read one line at a time. We can use this in a loop, but we need to know when to stop: after the end of file is reachedread()
andreadline()
yield empty results. (Just an “empty” line will still contain the'\n'
endline symbol.)Alternatively, we can just iterate the
file
object, yielding lines:do the counting as with
file.readlines()
:
Whole program:
n_chars, n_words, n_lines = 0, 0, 0
with open('lemons.txt') as f:
for line in f:
n_chars += len(line)
n_words += len(line.split())
n_lines += 1
print(n_chars, n_words, n_lines)
Remarks
While using file.read()
or file.readlines()
does work, these methods load the whole file into memory at once, which:
- may crash, if there is not enough memory
- will be slower (probably)
As we actually don’t need the data at once, using multiple file.readline()
calls or iterating the file
object are both techically superior, with iteration being more elegant.