Thursday, December 19, 2019

Sylvan Stroll: A tribute to Yodayas - II

Sylvan Stroll: A tribute to Yodayas - II: As noted by Hla Hla Htay (see my last post for the link), converting Myanmar text image to Unicode text through Google Docs would be rather...

Saturday, December 14, 2019

Sylvan Stroll: A tribute to Yodayas - I

Sylvan Stroll: A tribute to Yodayas - I: As a musically-illiterate song lover, I loved all beautiful songs, irrespective of the genre and language. This remark, I should add, is no...

Tuesday, October 22, 2019

How do spell checkers work?


Looking around the local resources for Myanmar language spelling checking, I noticed a number of research papers by our own people and also collaborative efforts with foreigners. But I didn’t notice they have incorporated their spell checking capability with some wordprocessing software, or share their code. But my search was not exhaustive and may be I was wrong.
With all the things I’ve been writing about, I am no better than a dabbler and IMHO that may suit you better because I am excited in finding new things, new ideas, and new ways to do things and I felt I should try to share with you fast, relatively speaking, though. Well, here at this point I am really excited about getting some ideas on how spell checkers work so I’ll be trying to share with you, my fellow dummies.

TIL from stackexchange

First, the stackoverflow answers to the question “How do spell checkers work?”. I come to know that in a nutshell the steps are
1. Read a dictionary file into memory.
2. Make that into a tree structure to make the search efficient.
3. Loop through all of the words in your document, and check each one against the search tree.
4. If a word is not found in the dictionary, inform the user and you can also suggest alternate spellings.
Regarding the details of implementation, I could consider prefixes and suffixes (the “Porter Stemming Algorithm”), and the Levenshteinn String Similarity which looks at how many letters must be added, removed, or swaped in a word in order to make another word. Related to these the python Natural Language Toolkit NLTK and Fuzzy Wuzzy are also suggested.
And I am highly stimulated by these words from an answer: “You could then loop through all possible correct spellings of words (only 171,000 english words and 3000 of those account for 95% of text). Determine those with the lowest levenshtein string similarity value, and then return the top X words that are most similar to the misspelled word.” If our guys are highly motivated there’s nothing to stop them from finding solutions for effective spell checking even if we need to handle 10 time that size in terms of syllable or words for our language.

From Quora

An aswer in Quora directed me to the article “How to Write a Spelling Corrector” by Peter Novic, Director of Research at Google. It gives a detailed description of the process and, I guess, a “must read”" for those interested (and a lot smarter than me) in spell checker development. For me and for now I would have to say TL;DR.

My champion

Novic’s spell check implementation has been converted into many different computer languages (35 in all) but the champion in terms of the number of lines of code was R.
The R code is given by Baath:
Well, just two lines of code is astounding, and Baath explains the reason:
  • The main reason for why the R version is so short is because base R includes the adist function. (A one line spell checker in R is indeed possible using the aspell function :)
  • A second reason for why the R version is so short is that the many vectorized functions in R make it possible to do a lot of work in one line.
  • Indeed, the horrible line creating the sorted_words vector would be a perfect target for some magrittr magic.
  • The R version does not solve the problem in exactly the same way as Norvig’s code. He maintains the count of each word in the NWORDS variable in order to be able to extract the most probable matching word. This is not necessary in the R code, as we already have a sorted vector we know that the first item always will be the most probable. Still, I believe the two approaches result in the same spelling corrections (but prove me wrong :).

Playing with Novic’s spell checker code

You could conveniently get the classic Myanmar Spelling Book, or wordlists such as Kanaung wordlist on GitHub. Now you have Novic’s spell checker code in different programming languages. Why won’t you choose the version of code of your choice and play? For me, I would earmark the R code, but can’t say when I would be ready to play.