Tomáš Kuzma

Python > Exercises #7: File Statistics

Task: Write a short program that can count the number of lines, words and characters in a given text file – lemons.txt.

On Unix-like systems (Linux, BSD, MacOS X…) there is a standard wc (word count) utility that does this:

$ wc lemons.txt
 13  87 490 lemons.txt

Correct solution in Python should report 477 characters, but the file is 490 bytes large.
(Such differences are caused by encoding of newlines and/or accented characters. Opening a text file in Python 3 yields UTF-8 encoded characters and '\n' linebreaks.)

Solution 1: `file.read()`

file.read() will return the whole file as one long string:

with open('lemons.txt') as f:
    text = f.read()

count the number of characters in the string (=in the file):
```
n_chars = len(text)
```
count the number of words using str.split(); by default it splits by whitespace, including spaces, tabs and newlines:
```
n_words = len(text.split())
```
count the number of lines using str.split(delim) – we’ll set the delimiter to be '\n', representing a newline:
```
n_lines = len(text.split('\n'))
```

Whole program:

with open('lemons.txt') as f:
    text = f.read()

n_chars = len(text)
n_words = len(text.split())
n_lines = len(text.split('\n'))

print(n_chars, n_words, n_lines)

Solution 2: `file.readlines()`

file.read() will return an array containing all the lines in the file:
```
with open('lemons.txt') as f:
    lines = f.readlines()
```
add up the number of characters in every line:
```
n_chars = 0
for line in lines:
    n_chars += len(line)
```
Alternatively, we can use the sum(iterable) builtin with a list comprehension:
```
n_chars = sum([len(x) for x in lines])
```

add up the number of words in every line:

n_words = sum([len(x.split()) for x in lines])

the number of lines is trivial:
```
n_lines = len(lines)
```

Whole program:

with open('lemons.txt') as f:
    lines = f.readlines()

n_chars = sum([len(x) for x in lines])
n_words = sum([len(x.split()) for x in lines])
n_lines = len(lines)

print(n_chars, n_words, n_lines)

Solution 3: reading file by lines

Using file.readline(), we can read one line at a time. We can use this in a loop, but we need to know when to stop: after the end of file is reached read() and readline() yield empty results. (Just an “empty” line will still contain the '\n' endline symbol.)
```
with open('lemons.txt') as f:
    while True:
        line = f.readline()

        if not line:
        break

        # do something
```
Alternatively, we can just iterate the file object, yielding lines:
```
with open('lemons.txt') as f:
    for line in f:
        # do something
```

do the counting as with file.readlines():

n_chars += len(line)
n_words += len(line.split())
n_lines += 1

Whole program:

n_chars, n_words, n_lines = 0, 0, 0

with open('lemons.txt') as f:
    for line in f:
        n_chars += len(line)
        n_words += len(line.split())
        n_lines += 1

print(n_chars, n_words, n_lines)

Remarks

While using file.read() or file.readlines() does work, these methods load the whole file into memory at once, which:

may crash, if there is not enough memory
will be slower (probably)

As we actually don’t need the data at once, using multiple file.readline() calls or iterating the file object are both techically superior, with iteration being more elegant.

Tomáš Kuzma

Python > Exercises #7: File Statistics

Solution 1: file.read()

Solution 2: file.readlines()

Solution 3: reading file by lines

Remarks

Solution 1: `file.read()`

Solution 2: `file.readlines()`