Lojban spell checking under Unix
I have lately been fiddling with ispell(1), a spell checker
that is popular on Unix clones, and I have assembled a limited, but
working hash file for the Lojban language. For those of you
not familiar with Ispell, the hash file is the dictionary file which
contains all the words for a particular language, and some rules that
makes it easier for Ispell to provides guesses for which word should be
substituted with the error.
If you have a Unix/Linux account and Ispell, and would like to catch
your Lojban typos, read the section below on how to set it up on your
system. If you're not running Unix, you can skip the next section, but
you might still be interested in the rest, which is about how I made the
word list, and some of the advantages and disadvantages about it and
Ispell.
Installing
You must be the system administrator (or able to persuade him/her) to
install the hash file in the correct directory. (At least I
think so - there might be a way to specify a file in
your home directory, but if there is, I don't know how.)
- Here is the dictionary file: lojban.hash.gz (9051 entries, 74KB). Download it.
- Run gunzip on it: gunzip lojban.hash.gz
- Put the resulting file, lojban.hash, in the directory /usr/lib/ispell/.
- Spell check your Lojban text files with: ispell -d
lojban your-lojban.text
The making of lojban.hash
I started with the official word lists of gismu, lujvo and cmavo. I used some
pipelines and text utilities to separate the raw words from the keywords
and explanation, and all the other excess verbiage in the wordlists. To
get a decent number of fu'ivla and cmene, I took them from the word frequency
list.
Then I proceeded with making an affix file. Affix files are
very useful for languages such as English, where many words are very
similar, and differ only in the first or last syllable (or both).
Encoding a "root" with all possible prefix/suffix combination on one
line, instead of using the entire words of all variations can save lots
of space in the hash file. But they are not so useful with Lojban, since
the Lojban affixes, rafsi, are so many that only little file space can
be saved with this method. Nevertheless, the program that builds the
final hash file (buildhash) requires both an affix file,
and a preprocessed word list (in which affixes have been replaced with
letter codes).
Advantages with using Ispell with Lojban text
The obvious advantage is that you can find genuine typos in your Lojban
texts before posting or publishing them. Just finding out which words
aren't in a word list is an easy task, but with Ispell's hash file, it goes
much faster, and it can guess which words should be substituted with the
error.
Another good thing about Ispell is that I (as a maker of
the word list) can fool around with the character set. In this way, I
can define the Lojban character set exactly the way it is. For example,
q and w becomes illegal characters (they're not part of the Lojban
alphabet), and "h" is the uppercase version of "'" (the apostrophe
character). The latter, together with the comma, is defined as a
"boundary character", ie. a character that only makes sense inside
words, not at the beginning or the end. Everything in accordance with
the finest Lojban tradition.
Disadvantages
Limitations with Ispell
If two words are run together, Ispell will suggest separating them,
either with a space or a hyphen. We don't need the hyphen in Lojban,
but Ispell is hard-coded to give the user the option to use it. It's
possible to tell Ispell to let the two words be run together, and in the
case of cmavo, it might be an advantage (with compounds such as
".iicai", "lesu'u", "cici" and so on). But since Ispell doesn't know
the difference between the different Lojban parts of speech, we'd end up
allowing brivla to be run together indiscriminately, which would be
very, very bad.
Problems with the dictionary file
As earlier mentioned, LLG publishes word lists for gismu, cmavo and
lujvo, but not for fu'ivla (loan words) and cmene (proper names), both
which are very important word types. I have countered this problem by
importing fu'ivla and cmene from a frequency list, but this introduces
other problems: the lists are quite old, with relatively few words, and
they're not "official" like the other lists; ie. some of them are wrong.
Arnt Richard Johansen,
arj@fix.no
Home