Lojban spell checking under Unix

I have lately been fiddling with ispell(1), a spell checker that is popular on Unix clones, and I have assembled a limited, but working hash file for the Lojban language. For those of you not familiar with Ispell, the hash file is the dictionary file which contains all the words for a particular language, and some rules that makes it easier for Ispell to provides guesses for which word should be substituted with the error.

If you have a Unix/Linux account and Ispell, and would like to catch your Lojban typos, read the section below on how to set it up on your system. If you're not running Unix, you can skip the next section, but you might still be interested in the rest, which is about how I made the word list, and some of the advantages and disadvantages about it and Ispell.

Installing

You must be the system administrator (or able to persuade him/her) to install the hash file in the correct directory. (At least I think so - there might be a way to specify a file in your home directory, but if there is, I don't know how.)
  1. Here is the dictionary file: lojban.hash.gz (9051 entries, 74KB). Download it.
  2. Run gunzip on it: gunzip lojban.hash.gz
  3. Put the resulting file, lojban.hash, in the directory /usr/lib/ispell/.
  4. Spell check your Lojban text files with: ispell -d lojban your-lojban.text

The making of lojban.hash

I started with the official word lists of gismu, lujvo and cmavo. I used some pipelines and text utilities to separate the raw words from the keywords and explanation, and all the other excess verbiage in the wordlists. To get a decent number of fu'ivla and cmene, I took them from the word frequency list.

Then I proceeded with making an affix file. Affix files are very useful for languages such as English, where many words are very similar, and differ only in the first or last syllable (or both). Encoding a "root" with all possible prefix/suffix combination on one line, instead of using the entire words of all variations can save lots of space in the hash file. But they are not so useful with Lojban, since the Lojban affixes, rafsi, are so many that only little file space can be saved with this method. Nevertheless, the program that builds the final hash file (buildhash) requires both an affix file, and a preprocessed word list (in which affixes have been replaced with letter codes).

Advantages with using Ispell with Lojban text

The obvious advantage is that you can find genuine typos in your Lojban texts before posting or publishing them. Just finding out which words aren't in a word list is an easy task, but with Ispell's hash file, it goes much faster, and it can guess which words should be substituted with the error.

Another good thing about Ispell is that I (as a maker of the word list) can fool around with the character set. In this way, I can define the Lojban character set exactly the way it is. For example, q and w becomes illegal characters (they're not part of the Lojban alphabet), and "h" is the uppercase version of "'" (the apostrophe character). The latter, together with the comma, is defined as a "boundary character", ie. a character that only makes sense inside words, not at the beginning or the end. Everything in accordance with the finest Lojban tradition.

Disadvantages

Limitations with Ispell

If two words are run together, Ispell will suggest separating them, either with a space or a hyphen. We don't need the hyphen in Lojban, but Ispell is hard-coded to give the user the option to use it. It's possible to tell Ispell to let the two words be run together, and in the case of cmavo, it might be an advantage (with compounds such as ".iicai", "lesu'u", "cici" and so on). But since Ispell doesn't know the difference between the different Lojban parts of speech, we'd end up allowing brivla to be run together indiscriminately, which would be very, very bad.

Problems with the dictionary file

As earlier mentioned, LLG publishes word lists for gismu, cmavo and lujvo, but not for fu'ivla (loan words) and cmene (proper names), both which are very important word types. I have countered this problem by importing fu'ivla and cmene from a frequency list, but this introduces other problems: the lists are quite old, with relatively few words, and they're not "official" like the other lists; ie. some of them are wrong.
Arnt Richard Johansen, arj@fix.no
Home