
My comments might be boring - you might want to skip to the "Why does it work" part. Robert Kehl wrote:
On Thursday, June 10, 2004 4:43 AM Boris Ratner
wrote: The problem was occuring when i tried to send 'iso-8859-8' encoded mail to otrs 1.2.3 *with* DefaultCharset->utf-8 set. I used Mozilla-mail to do this. Mozilla mail sends 'iso-8859-8' encoded mail with encoding set to 'iso-8859-8-i'.
This discovery was puzzeling me in regard to the python script i wrote to deal with this problem. I went to python Alias.pm equivalent and found out that i have made this chage before in python (and totaly forgot about it)
One could think of Mozilla handling 8859-8-I, which is a LTR language, the same way as 8859-8, which is to be written RTL. So Mozilla changed the language direction to its wish? Um...
Mozilla doesn't have "iso-8859-8" it has only "iso-8859-8-i". The difference suppose to be the "normal" is visual and the "-i" is logical in regard to the arrangement of hebrew chars in the file. example: lets say that this is a hebrew string "MOLAHS" ("S" is actually the first char of the RTL hebrew word ) if you save it in "visual" then in the file it would look as it seen on the screen "MOLAHS" ("M" would be the first char encountered in the file). In "logical" it would be writen to a file with the "correct" char order "SHALOM" ("S" comes first in the file) and the displaying software(text editor, browser) *should* turn it around and display it as "MOLAHS" on the screen. Proof -> I'm using konsole with hebrew fonts set-up it doesn't have the BIDI algorithm "smarts". if i "#cat iso-8859-8-i.eml" i will see the hebrew inside reversed because the "cat" utility will display the chars in the order the appear in the file.
This is because you yourself composed a iso-8859-8 message, which was flagged incorrectly by your mailer as iso-8859-8-i, which in the end you mapped to iso-8859-8 again.
actually it was marked correctly as "-i" it is Logical (e.g: the software that display the message will have to Reverse the apropriate chars (and the order of words) according to the BIDI algorithm. Again the difference is in the order of the charachters (and words) in the document (e-mail,file,web page). If i understand it correctly "-i" is a new thing in the ISO. In the past there was only "iso-8859-8" and if you got a mail (or web-page) with reverse hebrew you would go and change your display encoding manualy to "iso-8859-8 (logical)" or "iso-8859-8 (visual)" whichever will give you the right display. so "-i" was added recently so programs will be able to identify themselves if they should apply the BIDI algorythm before display (logical) or not (visual). Regarding charachter map itslef -> all charachters are *exactly* the same in "iso-8859-8" and "iso-8859-8-i". Why does it work: All of the most used Mail User Agent that support hebrew send mail in "logical" char order. Mozilla-mail doesn't have "visual" char order at all. UTF-8 support for RTL languages *IS* "logical". so when you recode in perl Encode module from "SHALOM" in iso-8859-8 to utf-8 the Encode module doesn't change the char(also word) order so it is converted to "SHALOM" in utf-8 and when my browser (or MUA) display the message in UTF-8 they automatically invoke the BIDI algorythm and display it correctly as "MOLAHS". That is what happens when a mail arrives into "article" table of my otrs database - when otrs does the apropriate HTML generation and it doesn't (and shoudn't) care what the order of the hebrew chars, but my browser does know that if its UTF-8 RTL language it should be displayed in reverse char (and word) order.
In article_plain table (or "plain" view)this mail look wrong. The message text (pain_text message , no attachments) was encoded wrongly to UTF-8 and displays totaly wrong charachters. I've verified that browser encoding is UTF-8 and used unicode terminal to view the content of article_plain to double check that .
I have no idea about that. What does happen to a mail originally sent as iso-8859-8 instead of its counterpart, -i?
It's not hebrew specific at all. Something is definetly wrong in the conversion of the ArticlePlain = {email}->as_string; to utf-8 when DefaultCharset -> utf-8 set. I'm having trouble finding the exact location where the recoding to utf-8 performed. My poor Perl abilities do not help. If someone would explain which modules are involved in recoding i'll probably find the problem. It's also might be a Postgresql problem. in table "article" mail components are all "varchar" in table "article_plain" the "body" field is of type "text" - postgre might have problems handling encodings in different field types. This problem has nothing to do with RTL or LTR or whatever. Same thing happens when i pipe test-email-8-bulgarian-cp1251.box (from the 1.2.3 tar.gz distro) into Postmaster.pl It looks nice and cyrillic (bulgarian sounds funnie to a russian reader like me :-) on QueueView and ArticleZoom (data taken from "article" table) when you click on "plain" link you see that the message text itself is gibrish (data taken from "article_plain" table. looks like "eropean" letters). A simple "SELECT" in uxterm (unicode-able xterm) query approves that the "message body" data for that article_id differ in those tables. As the subject states - the main issue is soved. But the thing is that the "plain" feature looses it's relevance. It doesn't keep the *original* mime source. regards, Boris Ratner.