Text Analysis: Correcting Misspelled Words

This is part 2 of our series on Text Analysis, previous segments are available in our archives.

Have you ever wondered how exactly a spell checker works? I mean, I find it pretty amazing that I can just type a rough approximation of a word on my computer or phone and then have it automatically change to the correct word (most of the time) without having to stop and adjust things. Today I’ll talk to you about one of the key algorithms behind a spell checker so you can understand a bit more about what is going on behind the scenes. You don’t need to build your own spell checker to correct all words, but algorithms like the one described today can be useful to correct technical terms or acronyms specific to your area of focus.

How are words different?

Understanding how words can vary is an important first step to computing the distance between them. There are four commonly used edit operations, corresponding to more than 80% of human misspellings1, that describe how to change from one word to another:

  • Insertion of a single letter, for example “ata” becomes “data” with the insertion of one “d”.
  • Deletion of a single letter, for example “datta” becomes “data” with the deletion of one “t”.
  • Substitution of a single letter, for example “dara” becomes “data” with one substitution of “t” for “r”.
  • Transposition of two adjacent letters, for example “dtaa” becomes “data” by swapping the “t” and first “a”.

Measuring the distance

One of the ways to measure the distance between two words using these four operations is the Damerau–Levenshtein distance. The crux of this algorithm is to compute the minimal number of any of the four operations in order to get to a recognized word (from a dictionary). Full-powered spell checkers build upon this algorithm to probabilistically determine which of the words that are only a few edits away is most likely.

The good news is that this algorithm, and spell checking in general, have been implemented in a number of different tools / languages, so you don’t need to do it yourself. My most recent use of the Damerau-Levenshtein distance was to use an implementation of the algorithm done in R to correct a continuous, raw-data input stream of domain-specific terms that a traditional spell checker could not handle.

Now that you have some insight into how to make sure all of the words are spelled correctly, a good first step in text analysis, tomorrow we will dig into understanding which are the most important terms in your data.

Questions? Send any questions on data analytics or pricing strategy to doug@outlier.ai and I’ll answer them in future issues!

 

[1] Damerau, Fred J. (March 1964), “A technique for computer detection and correction of spelling errors”, Communications of the ACM, ACM, 7 (3): 171–176.