[Aspell-user] What letters belongs to a word? (issue with Turkish "ı")

Discussion:

Daniel

2011-09-07 09:47:01 UTC

When I run my aspell (through emacs), the Turkish letter "ı" is not
considered to be part of words; I'm given suggestions "Bostanc",
when I actually wrote "Bastancı". I am given no suggestion for
"İstanbulı", since "İ" actually _is_ recognized as part of the
word... How can I tell aspell to include "ı" in words?

Agustin Martin

2011-09-08 18:22:17 UTC

Permalink

When I run my aspell (through emacs), the Turkish letter "??" is not
considered to be part of words; I'm given suggestions "Bostanc",
when I actually wrote "Bastanc??". I am given no suggestion for
"??stanbul??", since "??" actually _is_ recognized as part of the
word... How can I tell aspell to include "??" in words?

Which aspell and emacs version are you using?

--
Agustin

Daniel

2011-09-09 08:35:17 UTC

Permalink

Post by Agustin Martin

Which aspell and emacs version are you using?

This is emacs 23.3.1 and aspell 0.60.6.1.

To clarify, it is the Turkish "lowercase dotless i" that Aspell doesn't
recognize as part of a word. This is when I run aspell on my text using
an English dictionary (from aspell-en package).

Kevin Atkinson

2011-09-10 01:46:58 UTC

Permalink

Post by Daniel

Post by Agustin Martin

When I run my aspell (through emacs), the Turkish letter "Ä±" is not
considered to be part of words; I'm given suggestions "Bostanc",
when I actually wrote "BastancÄ±". I am given no suggestion for
"Ä°stanbulÄ±", since "Ä±" actually _is_ recognized as part of the
word... How can I tell aspell to include "Ä±" in words?

Which aspell and emacs version are you using?

This is emacs 23.3.1 and aspell 0.60.6.1.
To clarify, it is the Turkish "lowercase dotless i" that Aspell doesn't
recognize as part of a word. This is when I run aspell on my text using

Daniel

2011-09-10 15:58:31 UTC

Permalink

More or less. Please see
http://aspell.net/man-html/Notes-on-8_002dbit-Characters.html. That being
said the problem you are facing is not just because the dictionary is
8-bit, but also because I convert the document to the 8-bit encoding
before I tokenize it. The latter is something I plan to eventually fix.
If you really want to be able to recognize Turkish words when using the
English dictionary than you can try the attached special character set.
Unzip the contents in `aspell config data-dir` then change "charset
iso8859-1" to "charset iso8859-1-u" in en.dat.

Yes, since I have the knowledge to correctly spell non-English place names
etc, then I really want to do that. Thanks Kevin, that text explained quite
well why Aspell functions the way it does, and it now comes through as less
silly... The alternative charset worked fine!

However, even if Aspell did recognize the word correctly it would be
unlikely to do what you want when using the English dictionary because
special rules are needed to handle the Turkish ı when changing case.

Certainly. But now I can at least easily add whole words to the private
dictionary, even if I have to add them once per case.

Kevin Atkinson

2011-09-10 23:14:26 UTC

Permalink

Post by Daniel

Certainly. But now I can at least easily add whole words to the private
dictionary, even if I have to add them once per case.

Note that the personal dictionary is likely to be saved in this internal
encoding. This is unlikely what you want since only Aspell can read the
encoding. Once the personal dictionary is created you can convert it to
UTF-8 using the following command:

aspell conv iso8859-1-u utf-8 < OLD > NEW

then edit NEW and add the string " utf-8" to the first line. See
(http://aspell.net/man-html/Format-of-the-Personal-and-Replacement-Dictionaries.html)
You have Aspell use utf-8 by default by adding the line:

data-encoding utf-8

to "en.dat", see (http://aspell.net/man-html/The-Language-Data-File.html).

Having the dictionary save in the same encoding that the language uses is
due to historical reasons. I hope to eventually have the encoding Aspell
used for everything but its internal use default to utf-8.