Wiktionary

When creating the Writer’s Block game, I needed a list of valid words. This is a fairly common problem in programming and so I was surprised that I couldn’t find a simple word list available online. There are plenty of online dictionaries, downloadable dictionaries that you have to pay for and word lists without definitions, but I couldn’t find a free downloadable list of words with definitions. Except Wiktionary.

Wiktionary is the wikipedia for dictionaries and when William looked at it three years ago, it was missing a lot of words. That problem seems to have been fixed and they now have a very large dictionary that you can download and use for free. But there is one catch, the file that you can download is the xml with wiki formatted entries. This includes lots of foreign words, phrases, and proper nouns that aren’t legal words for word games. It also has a lot more text about each word than I wanted to keep. I just want the word and a simple definition. The xml file is also 2.2 GB which makes it much to large to include in a game.

So I wrote a program to parse through the wiktionary data and build the word list that I wanted. The program started simple and get more and more complex and ugly to handle more situations and to fix up some of the weird wiki style.

It is a testament to the speed of the modern computer that the program can run through 2 GB of text parsing out the word and definition in less than a minute.

In the end, of the 1.2 million entries in Wiktionary, there are 175 thousand “good” words which is about the same number as contained in the 20 volume Oxford English Dictionary. The resulting file of words and one line definitions is 10 MB and can be loaded into memory in about 2 seconds.

Here is a link to the WordList produced by my program. Keep in mind that Wiktionary is constantly changing. This copy was pulled on 3/20/2012. Also, remember that this data is derived from Wiktionary, so their license applies.

Leave a Reply