Yesterday I talked about how you can ensure the words you are analyzing are spelled correctly. Today, I’ll talk about one way to analyze those words in order to find meaning in a corpus of text, such as open-ended survey questions.
Counting the number of times each word appears in a survey response is probably the first thing that comes to mind when thinking of text analysis. It is a great statistic to calculate because the frequency of a word’s use is likely to be a good indicator of its importance. However, the downside with just counting terms is that there are a lot of common words, like “the” and “a”, that appear extremely often. You would have to undertake a manual effort of filtering your list or have use stop words to filter out common words (though knowing what should be included in the list of stop words is part of the challenge) to address this issue if all you do is count the frequency of terms. Is there an automated way to solve this?
Inverse Document Frequency
One way to balance the frequency of terms in each survey response is to compute another metric, called the inverse document frequency. The crux of this metric is to compute whether a term is commonly included in each survey response, or if it is very rarely used. The more commonly a term is used in all survey responses, the lower its inverse document frequency will be, and vice versa, the more rare a term is in all of the survey responses, the higher its inverse document frequency will be.
Putting It All Together
So now we have two metrics for each term its (1) term frequency and (2) inverse document frequency. Each of these metrics serve as a counter-balance to the other. By multiplying them together, you get a composite metric, called the term frequency-inverse document frequency (creative, I know), or tf-idf for short. Terms that have a high tf-idf are the most important ones to consider, while the lowest scoring terms could be used to create a list of stop words.
Tomorrow I’ll put this all into practice using responses from the Data Driven Survey using R. I’ll use term frequency and tf-idf to determine the most important words and show you how to visualize them using a word cloud.