Python > Exercises #7: File Statistics

Task: Write a short program that can count the number of lines, words and characters in a given text file – lemons.txt.

On Unix-like systems (Linux, BSD, MacOS X…) there is a standard wc (word count) utility that does this:

$ wc lemons.txt
 13  87 490 lemons.txt

Correct solution in Python should report 477 characters, but the file is 490 bytes large.
(Such differences are caused by encoding of newlines and/or accented characters. Opening a text file in Python 3 yields UTF-8 encoded characters and '\n' linebreaks.)

Solution 1: file.read()

  1. file.read() will return the whole file as one long string:

    with open('lemons.txt') as f:
        text = f.read()
  2. count the number of characters in the string (=in the file):

    n_chars = len(text)
  3. count the number of words using str.split(); by default it splits by whitespace, including spaces, tabs and newlines:

    n_words = len(text.split())
  4. count the number of lines using str.split(delim) – we’ll set the delimiter to be '\n', representing a newline:

    n_lines = len(text.split('\n'))

Whole program:

with open('lemons.txt') as f:
    text = f.read()

n_chars = len(text)
n_words = len(text.split())
n_lines = len(text.split('\n'))

print(n_chars, n_words, n_lines)

Solution 2: file.readlines()

  1. file.read() will return an array containing all the lines in the file:

    with open('lemons.txt') as f:
        lines = f.readlines()
  2. add up the number of characters in every line:

    n_chars = 0
    for line in lines:
        n_chars += len(line)

    Alternatively, we can use the sum(iterable) builtin with a list comprehension:

    n_chars = sum([len(x) for x in lines])
  3. add up the number of words in every line:

    n_words = sum([len(x.split()) for x in lines])
  4. the number of lines is trivial:

    n_lines = len(lines)

Whole program:

with open('lemons.txt') as f:
    lines = f.readlines()

n_chars = sum([len(x) for x in lines])
n_words = sum([len(x.split()) for x in lines])
n_lines = len(lines)

print(n_chars, n_words, n_lines)

Solution 3: reading file by lines

  1. Using file.readline(), we can read one line at a time. We can use this in a loop, but we need to know when to stop: after the end of file is reached read() and readline() yield empty results. (Just an “empty” line will still contain the '\n' endline symbol.)

    with open('lemons.txt') as f:
        while True:
            line = f.readline()
    
            if not line:
            break
    
            # do something

    Alternatively, we can just iterate the file object, yielding lines:

    with open('lemons.txt') as f:
        for line in f:
            # do something
  2. do the counting as with file.readlines():

    n_chars += len(line)
    n_words += len(line.split())
    n_lines += 1

Whole program:

n_chars, n_words, n_lines = 0, 0, 0

with open('lemons.txt') as f:
    for line in f:
        n_chars += len(line)
        n_words += len(line.split())
        n_lines += 1

print(n_chars, n_words, n_lines)

Remarks

While using file.read() or file.readlines() does work, these methods load the whole file into memory at once, which:

As we actually don’t need the data at once, using multiple file.readline() calls or iterating the file object are both techically superior, with iteration being more elegant.