Lab 10: Word Frequency Counter

Goals for this lab

By the time you have completed this lab, you should:

  • know what a hash table is
  • understand that the C++ standard template library (STL) provides implementations of the data structures we discussed in this class (use them!)
  • have experience working with a few STL classes

Lab pairing

For this lab you may work with a partner. To facilitate your submissions as a pair you MUST pair up using the submission system. When you visit the project page from where you can view all your submission you should see a link to a page where you can join a group. Of the two people in the group, one of you must invite the other, and the other must confirm the invitation. Once you group up, any submissions that either you, or your partner have already made will be visible to each other in addition to all future submissions.

Lab preparation

After logging in, create the directory for this lab:

mkdir -p cs24/lab10

Change into the lab directory and copy the files you will need for this lab:

cd cs24/lab10
cp ~bboe/public_html/cs24_f13/code/lab10/* .

Hash table

Definition

A hash table is an incredibly useful container data structure. A hash table is an unordered mapping between a unique key and a value. The operations insert, contains, and remove each run in constant time on average. The speed is acheived by utilizing a clever storage mechanism in combination with an intentional abundance of storage space.

Internal structure

In a nutshell, values (and their associated keys) stored within a hash table are stored at some position in an array. That position is calculated by passing the key through a hash function which returns values ranging from zero to one less than the size of the array.

For example, assume we have a small hash table to identify if a given number is odd or even. It will have two records in its array:

0 "Even"
1 "Odd"

Say we have the hash function return key % 2 == 0 that returns 0 if the key is divisible by 2, and 1 otherwise. Thus given any number we can look up its value in the hash table, which in this contrived example tells us whether the number is odd or even.

In the previous example we use our simple example to fetch or access a value. Hash tables can also be utilized to store or update a value. For instance we can also use hash tables to count the number of odds and evens in some given sequence.

Now our table array has the following records:

0 # evens
1 # odds

If we are encountering an even or an odd at some point in the code, we can simply increase corresponding number in the table via array[hash_result]++;

In this case, our key maps to a value that corresponds to the number of similar (odd or even) numbers.

In class we will discuss in more depth the implementation details of hash tables. However, for the lab you will just need to use an existing hash table. Hash tables, in general, can store many more than 2 items, in fact they can store an unbounded number of items with the right implementation.

Standard Template Library (STL)

Throughout this class we have implemented a number of data structures. When we started working with C++ we made those data structures generic by templating them. As it turns out, in the real world you seldom need to write the implementation for a data structure as they are often provided either directly by the language, or through a library for the language. In C++, the standard template library, or STL, provides some form of all the data structures we discuss in this class (among others).

The following are some you will want to use in this lab.

vector

One such ADT is the vector class that allows you to create a dynamic array of an arbitrary type and provides methods for insertion and removal (similar to our array-based List implementation). The following is an example utilizing a vector of integers:

#include <iostream>
#include <vector>
using namespace std;

int main(){
    vector<int> nums(3, 5);  // create array of 3 ints, each initialized to 5
                             // nums = [5, 5, 5]
    nums.push_back(6);       // add 6 in the end
                             // nums = [5, 5, 5, 6]
    nums.pop_back();         // remove the last element
                             // nums = [5, 5, 5]
    for(vector<int>::iterator i = nums.begin(); i != nums.end(); ++i)
        cout << *i << endl;  // outputs the content of each item in nums
    return 0;
}

Aside from the vector there is one new concept in the above example code: the iterator. In general, an iterator is a pointer-like object that is used to traverse through elements in ADTs that support them. They are pointer-like because the value stored at the element is obtainable by dereferencing the iterator, and incrementing, or decrementing (for iterators that are bidirectional) the iterator by one will advance to the next or previous item in the ADT.

In the example above, to iterate through a vector of integers we declare (in the for loop) vector<int>::iterator i. We say that we start from the first element i = vector.begin() and we print numbers until i is pointing to the “end” element, which is one after the last in the vector i != nums.end().

string

We’ve already seen strings in this class. However, for completeness reference the following example if needed.

#include <iostream>
#include <string>
using namespace std;

int main() {
    // Assume the standard input stream contains "abc qwe\n345\n"
    string s1, s2, s3;
    cin >> s1;  // s1 = "abc"
    cin >> s2;  // s2 = "qwe"
    cin >> s3;  // s3 = "345"
    s1 = s1 + s2 + s3;
    cout << s1 << endl;  // abcqwe345\n
    return 0;
}

unordered_map (hash table)

One more example is unordered_map, which is an implementation of a hash table that maps a key to value.

#include <iostream>
#include <unordered_map>
using namespace std;

int main() {
    int value;
    string key("abc");
    unordered_map<string, int> hash_table;  // initialize the hash table

    // test if a key is contained in the hash table
    cout << (hash_table.find(key) != hash_table.end()) << endl;  // 0

    // Accessing an item that was not previously contained will add it
    // using the default constructor (initialize to 0 for type int)
    cout << hash_table[key] << endl;  // 0

    // test if a key is contained in the hash table
    cout << (hash_table.find(key) != hash_table.end()) << endl;  // 1

    hash_table[key]++;  // Increment the key
    cout << hash_table[key] << endl;  // 1

    return 0;
}

The first thing to note in this example is that the unordered_map class requires two template arguments. The first represents the type of the key, in this case a string, and the second represents the type of the value, in this case an int.

If we want to iterate through the table (you will have to do some form of this for the lab), then we have to create appropriate iterator unordered_map<string, int>::iterator it = hash_table.begin();

Each element returned by the hash table iterator is of type pair that contains two attributes: first, the key; and second, the value. To refer separately to the key try it->first, for the value try it->second.

NOTE: The unordered_map class does not exist in older versions of C++. Thus in order to compile this program (and your lab10 program) you will need to add -std=c++11 to the clang compile line.

sort

Finally, the STL provides a numper of often used/needed algorithms. For example, in this lab we will be using the sort function. The sort function will sort elements in ascending order, unless provided a sorting function that specifies how individual elements should be compared.

For example, if we wanted to output the elements of our hash table in reverse (descending) order based on the key we would do the following:

#include <algorithm>
#include <iostream>
#include <string>
#include <unordered_map>
#include <vector>
using namespace std;

bool sort_reverse(const pair<string, int> &a, const pair<string, int> &b) {
  return a.first > b.first;
}

int main(){
  // Initialize the hash table with some fun values
  unordered_map<string, int> ht = {{"Bananas", 10}, {"Apples", 16},
                                   {"Pears", 8}, {"Zebras", 1}};

  // unorderd_map iterators cannot be sorted so we must copy its contents
  // into a something sortable -- here we use a vector
  vector<pair<string, int> > values(ht.begin(), ht.end());
  // sort all of the values in the vector using the function `reverse` for
  // comparisons
  sort(values.begin(), values.end(), sort_reverse);

  // Output the items
  // The `auto` keyword in C++11 will automatically determine its type.
  // In this case its type is: vector<pair<string, int> >::iterator
  for(auto it = values.begin(); it != values.end(); ++it)
    cout << it->first << " " << it->second << endl;

  return 0;
}

Writing frequency.cpp

In this lab you will write a program that calculates number of occurrences of unique words given via standard input. The program should print out statistics of the encountered words. The order of the output should be such that the most frequently used words are listed first, and words of the same frequency are sorted according to however strings are sorted by default (lexicographically).

Words can include punctuation and are separated by any form of whitespace. Using cin >> term where term is a string will handle the whitespace issue for you. All comparisons should be case insensitive, meaning that IS, Is, iS, and is should all be counted as one word, is. You will need to write a function that converts a string to its lowercase form. You probably want to use the tolower function to help you with that.

For example echo "ASD aSd asd sdf dfg." | ./a.out should produce the output:

3 asd
1 dfg.
1 sdf

It is also worth noting that the result of cin >> term can be used as a boolean expression indicating whether or not the the input of type term succeeded. The expression will evaluate to false when EOF is reached.

Submitting the assignment

Only one person in a group need submit the assignment but make sure that both you and your partner can view the submission on the submission site. If you cannot, you need to complete the making a group process. Please review the submission instructions as needed. Note that you may resubmit this assignment as many times as necessary up until the deadline.

 

Template design by Andreas Viklund