Re: [otrs] Postgresql UTF and other encodings - SOLVED (partially)

10 Jun 2004

      My comments might be boring - you might want to skip to the "Why does it 
work" part.

Robert Kehl wrote:
...
On Thursday, June 10, 2004 4:43 AM
Boris Ratner  wrote:
...
The problem was occuring when i tried to send 'iso-8859-8' encoded
mail to otrs 1.2.3 *with* DefaultCharset->utf-8 set.
I used Mozilla-mail to do this. Mozilla mail sends 'iso-8859-8'
encoded mail with encoding set to 'iso-8859-8-i'.
This discovery was puzzeling me in regard to the python script i wrote 
to deal with this problem.
I went to python  Alias.pm equivalent and found out that i have made 
this chage before in python
(and totaly forgot about it)
...
One could think of Mozilla handling 8859-8-I, which is a LTR language,
the same way as 8859-8, which is to be written RTL. So Mozilla changed
the language direction to its wish? Um...
Mozilla doesn't have "iso-8859-8" it has only "iso-8859-8-i".
The difference suppose to be  the "normal" is visual and the "-i" is 
logical
in regard to the arrangement of hebrew chars in the file.
example:  lets say that this is a hebrew string "MOLAHS"  ("S" is 
actually the first char of the RTL hebrew word )
if you save it in "visual" then in the file it would look as it seen on 
the screen "MOLAHS" ("M" would be
the first char encountered in the file). In "logical" it would be writen 
to a file with the "correct" char order
"SHALOM"  ("S" comes first in the file)  and the displaying 
software(text editor, browser) *should* turn it around
and display it as "MOLAHS" on the screen.
Proof -> I'm using konsole with hebrew fonts set-up
it doesn't have the BIDI algorithm "smarts". if i "#cat 
iso-8859-8-i.eml" i will see the hebrew inside reversed
because the "cat" utility will display the chars in the order the appear 
in the file.
...
This is because you yourself composed a iso-8859-8 message, which was
flagged incorrectly by your mailer as iso-8859-8-i, which in the end you
mapped to iso-8859-8 again.
actually it was marked correctly as "-i" it is Logical (e.g: the 
software that display the message
will have to Reverse the apropriate chars (and the order of words) 
according to the BIDI algorithm.
Again the difference is in the order of the charachters (and words) in 
the document (e-mail,file,web page).
If i understand it correctly "-i" is a new thing in the ISO. In the past 
there was only "iso-8859-8"
and if you got a mail (or web-page) with reverse hebrew you would go and 
change your display encoding manualy
to "iso-8859-8 (logical)" or "iso-8859-8 (visual)" whichever will give 
you the right display.
so "-i" was added recently so programs will be able to identify 
themselves if they should apply the BIDI algorythm
before display (logical) or not (visual). Regarding charachter map 
itslef -> all charachters are *exactly* the same
in "iso-8859-8" and "iso-8859-8-i".

Why does it work:

All of the most used Mail User Agent that support hebrew send mail in 
"logical" char order.
Mozilla-mail doesn't have "visual" char order at all.
UTF-8 support for RTL languages *IS* "logical".
so when you recode in perl  Encode module from "SHALOM" in iso-8859-8 to 
utf-8
the Encode module doesn't change the char(also word) order so it is 
converted to "SHALOM" in utf-8
and when my browser (or MUA) display the message in UTF-8 they 
automatically invoke the BIDI algorythm
and display it correctly as "MOLAHS".
That is what happens when a mail arrives into "article" table of my otrs 
database - when
otrs does the apropriate HTML generation and it doesn't (and shoudn't) 
care what the order of the hebrew chars,
but my browser does know that if its UTF-8  RTL language it should be 
displayed in reverse char (and word) order.
...
...
In article_plain table (or "plain" view)this mail look wrong. The
message text (pain_text message , no attachments)
was encoded wrongly to UTF-8 and displays totaly wrong charachters.
I've verified that browser encoding is UTF-8 and used unicode terminal
to view the content of article_plain to
double check that .
I have no idea about that. What does happen to a mail originally sent as
iso-8859-8 instead of its counterpart, -i?
It's not hebrew specific at all. Something is definetly wrong in the 
conversion of
the ArticlePlain = {email}->as_string; to utf-8 when DefaultCharset -> 
utf-8 set.
I'm having trouble finding the exact location where the recoding to 
utf-8 performed.
My poor Perl abilities do not help. If someone would explain which 
modules are involved in recoding
i'll probably find the problem. It's also might be a Postgresql problem. 
in table "article" mail components
are all "varchar" in table "article_plain" the "body" field is  of type 
"text" - postgre might have problems
handling encodings in different field types.
This problem has nothing to do with RTL or LTR or whatever.
Same thing happens when i pipe test-email-8-bulgarian-cp1251.box (from 
the 1.2.3 tar.gz distro) into Postmaster.pl
It looks nice and cyrillic (bulgarian sounds funnie to a russian reader 
like me :-) on QueueView and ArticleZoom (data taken from "article" 
table) when you click on "plain" link
you see that the message text itself is gibrish (data taken from 
"article_plain" table. looks like "eropean" letters).
A simple "SELECT" in uxterm (unicode-able xterm) query approves that the 
"message body" data
for that article_id differ in those tables.

As the subject states - the main issue is soved. But the thing is that 
the "plain" feature looses it's relevance.
It doesn't keep the *original* mime source.

regards,
Boris Ratner.