Tomáš Kuzma

Python > Exercises #4: Counting words

Task 1: Count all the words

Read a text from the keyboard, then display the number of words in it.

Viable solution:

line  = input()      # read from keyboard
words = line.split() # split into words
count = len(words)   # count the words
print(count)

Or in one line:

print(len(input().split()))

Task 2: Count distinct words

Display only the number of distinct words. Count words that differ only in letter case, such as 'example', 'Example' and 'exAMPLE', as one word.

One possible way is to construct a list of unique (lower-cased) words, and then consulting it for each word of the input, expanding this list when appropriate.

words = input().split()

seen = []

for x in words:
    if x.lower() not in seen:
        seen.append(x.lower())

print(len(seen))

(The convenient in operator could also be emulated by list.count(value) or by a nested for cycle with an if condition; both variants decidedly less straight-forward.)

The in operator is slow for lists, so it would be better to use a set. As a bonus, set’s .add() method will never cause duplicities, so we can skip the membership check altogether:

words = input().split()

seen = set()

for x in words:
    seen.add(x.lower())

print(len(seen))

Or in one line:

print(len({x.lower() for x in input().split()}))

Task 3: Count word occurrences

For each word in the input, output the number of occurrences, sorted alphabetically.

We’ll maintain a dict(), mapping each word to the number of its occurrences. Processing the input word-by-word, in each iteration we’ll increment the corresponding entry in the dictionary. There’s only one problem: at the beginning, the number of occurrences for each word is technically not zero, but undefined. Accessing dict[word] would result in an error.

The most straight-forward solution is to explicitly test (e.g. with the in operator), whether such a word is already in the dictionary (and was therefore encountered before), or it’s not (and this is the first occurrence), and then act accordingly:

words = input().split()
words = [x.lower() for x in words]

d = dict()

for x in words:
    if x not in d:
        d[x]  = 1
    else:
        d[x] += 1

for k, v in sorted(d.items()):
    print(k + ': ' + str(v))

Alternatively, we could replace dict[key] access with the dict.get(key, default) method, which, when the key is not in the dictionary, returns the specified default value:

words = input().split()
words = [x.lower() for x in words]

d = dict()

for x in words:
    d[x] = d.get(x, 0) + 1

for k, v in sorted(d.items()):
    print(k + ': ' + str(v))

Furthermore, the collections module contains a drop-in dict() replacement defaultdict(type), which returns the default value also on d[key] access.
(Defaults are 0 for int, 0.0 for float, False for bool and empty for complex types.)

import collections

words = input().split()
words = [x.lower() for x in words]

d = collections.defaultdict(int)

for x in words:
    d[x] += 1

for k, v in sorted(d.items()):
    print(k + ': ' + str(v))

Task 4: Order by most used

Print the words ordered by frequency, from the most common to the least common words.

In previous examples, sorted(d.items()) sorts the items alphabetically. (Actually, it sorts the key-value pairs from d.items() lexicographically, that is first by the key, and in case of a tie (which won’t happen in a dictionary), by the value.)

To sort by the value first (and then by the key), we can construct a list consisting of reversed value-key pairs, and then sort and reverse it:

import collections

words = input().split()
words = [x.lower() for x in words]

d = collections.defaultdict(int)

for x in words:
    d[x] += 1

a = [(v,k) for k,v in d.items()]
a.sort()
a.reverse()

for v, k in a:
    print(k + ': ' + str(v))

Sneak peek: using key functions

The list.sort(…) and sorted(…) functions also accept two interesting parameters:

reverse: change the sort order (default: low to high, reverse=True: high to low)
key: sort the items using a value derived from an item by a key function

One pre-made key function is operator.itemgetter(index), which grabs the i-th subitem:

import collections
import operator

words = input().split()
words = [x.lower() for x in words]

d = collections.defaultdict(int)

for x in words:
    d[x] += 1

for k, v in sorted(d.items(), key=operator.itemgetter(1), reverse=True):
    print(k + ': ' + str(v))

And more arcanely, without importing operator, using a lambda function:

import collections

words = input().split()
words = [x.lower() for x in words]

d = collections.defaultdict(int)

for x in words:
    d[x] += 1

for k, v in sorted(d.items(), key=lambda x: x[1], reverse=True):
    print(k + ': ' + str(v))