[Aspell-user] Configuring spell check in mult language documents

Discussion:

Swarup

2011-07-08 12:35:04 UTC

I note the following question and reply, posted previously on this list:

-----------

How to configure aspell to check multiple languages in same document?

I'm afraid you can't. It is a long standing feature request that might
get implemented someday. Unfortunately, it is not a straightforward
task.
-----------

Still, even if checking multiple languages in the same document remains
too formidable a task at this time, how about getting the spellchecker
to ignore words in a different script? If it would stop underlining all
the English words when I am spellchecking Hindi, that would give
tremendous relief.

So would the algorithm for getting aspell to ignore latin script be
different from getting it to spellcheck multiple languages
simultaneously?

Carlo Traverso

2011-07-08 18:24:54 UTC

Permalink

Swarup> I note the following question and reply, posted previously

How to configure aspell to check multiple languages in same
document?

Swarup> I'm afraid you can't. It is a long standing feature
Swarup> request that might get implemented someday. Unfortunately,
Swarup> it is not a straightforward task. -----------

The best approximation is to use

aspell list -l lang1 | aspell list -l lang2

Only the words unacceptable in both languages will appear in the
standard output. The problem is when the set of words is different in
the two languages. For example, german does not accept an apostrophe
as a word component, but english accepts an apostrophe between two
letters; hence german and english have a different definition of
"word".

Swarup> Still, even if checking multiple languages in the same
Swarup> document remains too formidable a task at this time, how
Swarup> about getting the spellchecker to ignore words in a
Swarup> different script? If it would stop underlining all the
Swarup> English words when I am spellchecking Hindi, that would
Swarup> give tremendous relief.

Swarup> So would the algorithm for getting aspell to ignore latin
Swarup> script be different from getting it to spellcheck multiple
Swarup> languages simultaneously?

I did not check the hindi dictionaries, but probably hindi accepts
both latin and hindi characters as word components (this is how
ancient greek, grc, does). The solution of your problem could be to
define a variant of hindi that only accepts hindi characters.

Carlo Traverso

Mahesh T. Pai

2011-07-08 18:49:20 UTC

Permalink

Post by Carlo Traverso
aspell list -l lang1 | aspell list -l lang2

That would take the words out of their context, no?

Post by Carlo Traverso
I did not check the hindi dictionaries, but probably hindi accepts
both latin and hindi characters as word components (this is how
ancient greek, grc, does). The solution of your problem could be to
define a variant of hindi that only accepts hindi characters.

AFAICT, no. Especially if you are putting that in the linguistic
sense.

Hindi (and most Indic languages) use the 16 bit mapping in UTF-8
encoding schema.

I suspect that the difficulties mentioned by Kevin have more to do
with aspell being "internally 8 bit", as Kevin put it some months back.

Probably, the difficulty is in distinguishing between few bytes of 8
bit characters, followed by few bytes of 16 bit characters. Of course,
I am no expert or even a programmer and I may be way off mark.

If you want a look at the kind of documents we have in mind, have a
look at

http://finance.kerala.gov.in/
index.php?option=com_docman&task=doc_download&gid=3047&Itemid=34

(watchout for a broken line - to avoid problems in mailers)

That is a pdf file, with both English and Malayalam script. We use
plenty of documents like that. The pdf itself is unlikely to use
UTF-8, so do not use it as an example for anything except visual
representation of the text.

--
Mahesh T. Pai ||
DICTIONARY, n. A malevolent literary device for cramping the
growth of a language and making it hard and inelastic.

Kevin Atkinson

2011-07-09 04:25:09 UTC

Permalink

Post by Mahesh T. Pai

Post by Carlo Traverso
aspell list -l lang1 | aspell list -l lang2

That would take the words out of their context, no?

AFAICT, no. Especially if you are putting that in the linguistic
sense.

The linguistic sense, is not relevant here, what is relevant is if Aspell
treats Latin characters as part of the word. I just looked up how Hindi
is configured in Aspell and this is not the case. Thus, the problem is
that gedit is trying to check Latin words even though they are in a
completely different script. Naturally these words are not in the Hindi
dictionary so Aspell marks then as incorrect. The best solution is not
to try to check those words at all (which is what Aspell will do if you
check the file from the command line), the second best solution is for
Aspell to mark words without any word characters as correct, which I
discussed in an earlier email.

Post by Mahesh T. Pai
Hindi (and most Indic languages) use the 16 bit mapping in UTF-8
encoding schema.
I suspect that the difficulties mentioned by Kevin have more to do
with aspell being "internally 8 bit", as Kevin put it some months back.

That has absolutely nothing to do with it.

Kevin Atkinson

2011-07-08 20:33:30 UTC

Permalink

Post by Swarup
-----------

How to configure aspell to check multiple languages in same document?

I'm afraid you can't. It is a long standing feature request that might
get implemented someday. Unfortunately, it is not a straightforward
task.
-----------
Still, even if checking multiple languages in the same document remains
too formidable a task at this time, how about getting the spellchecker
to ignore words in a different script? If it would stop underlining all
the English words when I am spellchecking Hindi, that would give
tremendous relief.

A lot of it depends on how the application uses Aspell. If it tokenizes
the words itself, than Aspell will always return false when it calls
the "check" method on a word that contains foreign characters.

If Aspell did the tokenizing, then it depends on how the language is
configured. If its a Latin based language than all non-Latin words will
be skipped. If its a non-Latin based language than Latin word may or may
not be skipped depending on if Latin letters are configured to be accepted
as word characters, which depends on the charset used. Changing the
language configuration is doable but will take a bit of work.

Post by Swarup
So would the algorithm for getting aspell to ignore latin script be
different from getting it to spellcheck multiple languages
simultaneously?

Its is fairly simple task to get the "check" method to simply accept words
that don't contain any letter characters. However, the language needs to
be correctly configured (see above). Your welcome to file a feature
request (if there is not one already).