Postgresql UTF and other encodings

Hi, List! I'm sorry in advance about asking the same question for a second time. My OTRS 1.2.3 (default configured as UTF-8) runs with PgSQL configured for UNICODE in the background and works well. BUT when a mail with the following encodings arrives ISO-8859-8 ISO-8859-8-I and windows-1255 the database (not OTRS) screams "THAT IS NOT UNICODE WHAT YOU ARE TRYING TO WRITE". The mail is never registeres into the system. Therefor i'm searching for a tool to reencode ANY incoming mail encoding into UTF8 before passing it to the postmaster. To be able to replicate this problem you have to send a mail not just encoded in some other encoding but to actually put some characters of that language in ( try chineese non Unicode encoding). This functionality will do good because: 1. You will be able to have multiple languages on one page. 2. You don't have to deal with locale encoding in the system. 3. perl already thinks in unicode why should we use that? Anyone did something like this? Ideas are more then welcome. Thanks for your help. Kind Regards, Boris Ratner.

On Monday, May 31, 2004 10:05 PM
Boris Ratner
My OTRS 1.2.3 (default configured as UTF-8) runs with PgSQL configured for UNICODE in the background and works well. BUT when a mail with the following encodings arrives ISO-8859-8 ISO-8859-8-I and windows-1255 the database (not OTRS) screams "THAT IS NOT UNICODE WHAT YOU ARE TRYING TO WRITE". The mail is never registeres into the system.
What is your Perl version? If you own perl 5.8.0 at least, you should be able to convert from any to utf-8. It is not possible below Perl 5.8.0. hth, Robert Kehl -- ((otrs.de)) :: OTRS GmbH :: Norsk-Data-Str. 1 :: 61352 Bad Homburg http://www.otrs.de/ :: Tel. +49 (0)6172 4832388

On 5-06-2004 at 10:24, Robert Kehl wrote:
What is your Perl version? If you own perl 5.8.0 at least, you should be able to convert from any to utf-8. It is not possible below Perl 5.8.0.
Also, I strongly suggest to upgrade to 5.8.4 rather than running 5.8.0 (that contains some major bugs with utf8 support). - Alessandro

Hi! My Perl version is 5.8.4. What i did to work-around the issue of perl pushing non-utf mail to unicode based postgresql is something more global - i wrote a simple script in (plz don't kill me ) python which runs before postmaster.pl. This script receives e-mail in *any* encoding and recodes it to utf-8. particulary "to" "from" "subject" converted from "?=some_encoding?E20=E34=...." to "?=utf-8?...." ofcourse the body of the message converted too (that's the easy part). It works great on plain-text with no attachments. I still need to figure out what to do with multipart mail (html) and encoded message body. Boris. Alessandro Ranellucci wrote:
On 5-06-2004 at 10:24, Robert Kehl wrote:
What is your Perl version? If you own perl 5.8.0 at least, you should be able to convert from any to utf-8. It is not possible below Perl 5.8.0.
Also, I strongly suggest to upgrade to 5.8.4 rather than running 5.8.0 (that contains some major bugs with utf8 support).
- Alessandro
_______________________________________________ OTRS mailing list: otrs - Webpage: http://otrs.org/ Archive: http://lists.otrs.org/pipermail/otrs To unsubscribe: http://lists.otrs.org/cgi-bin/listinfo/otrs Support oder Consulting für Ihr OTRS System? =http://www.otrs.de/

On Saturday, June 05, 2004 11:31 PM
Boris Ratner
What i did to work-around the issue of perl pushing non-utf mail to unicode based postgresql is something more global - i wrote a simple script in (plz don't kill me ) python which runs before postmaster.pl. This script receives e-mail in *any* encoding and recodes it to utf-8. particulary "to" "from" "subject" converted from "?=some_encoding?E20=E34=...." to "?=utf-8?...." ofcourse the body of the message converted too (that's the easy part). It works great on plain-text with no attachments. I still need to figure out what to do with multipart mail (html) and encoded message body.
The above is one of the jobs OTRS performs - no need to invent the wheel twice. Better we search the error and correct it. What is the setting of $Self->{DefaultLanguage} in your Config.pm? Regards, Robert Kehl -- ((otrs.de)) :: OTRS GmbH :: Norsk-Data-Str. 1 :: 61352 Bad Homburg http://www.otrs.de/ :: Tel. +49 (0)6172 4832388

Ok, here it goes : I've took Robert's advice and went as deep as possible with it here are he details: The problem was occuring when i tried to send 'iso-8859-8' encoded mail to otrs 1.2.3 *with* DefaultCharset->utf-8 set. I used Mozilla-mail to do this. Mozilla mail sends 'iso-8859-8' encoded mail with encoding set to 'iso-8859-8-i'. example:" From: =?ISO-8859-8-I?Q?=E1=E5=F8=E9=F1_=F8=E0=E8=F0=F8?= " Perl Encode module doesn't have this encoding or alias - so i have added it as an alias to the "normal" form in /usr/lib/perl/5.8.4/Encode/Alias.pm. Now mail arrives successfully and it looks great on QueueView and ArticleZoom. In article_plain table (or "plain" view)this mail look wrong. The message text (pain_text message , no attachments) was encoded wrongly to UTF-8 and displays totaly wrong charachters. I've verified that browser encoding is UTF-8 and used unicode terminal to view the content of article_plain to double check that . Thanks for your help. Boris Ratner.
The above is one of the jobs OTRS performs - no need to invent the wheel twice. Better we search the error and correct it.
What is the setting of $Self->{DefaultLanguage} in your Config.pm?
Regards,
Robert Kehl
-- ((otrs.de)) :: OTRS GmbH :: Norsk-Data-Str. 1 :: 61352 Bad Homburg http://www.otrs.de/ :: Tel. +49 (0)6172 4832388
_______________________________________________ OTRS mailing list: otrs - Webpage: http://otrs.org/ Archive: http://lists.otrs.org/pipermail/otrs To unsubscribe: http://lists.otrs.org/cgi-bin/listinfo/otrs Support oder Consulting f?r Ihr OTRS System? => http://www.otrs.de/

On Thursday, June 10, 2004 4:43 AM
Boris Ratner
The problem was occuring when i tried to send 'iso-8859-8' encoded mail to otrs 1.2.3 *with* DefaultCharset->utf-8 set. I used Mozilla-mail to do this. Mozilla mail sends 'iso-8859-8' encoded mail with encoding set to 'iso-8859-8-i'.
One could think of Mozilla handling 8859-8-I, which is a LTR language, the same way as 8859-8, which is to be written RTL. So Mozilla changed the language direction to its wish? Um...
example:" From: =?ISO-8859-8-I?Q?=E1=E5=F8=E9=F1_=F8=E0=E8=F0=F8?= " Perl Encode module doesn't have this encoding or alias - so i have
From http://search.cpan.org/~dankogai/Encode-2.01/lib/Encode/Supported.pod: "ISO-8859-8-1 [Hebrew] None of the Encode team knows Hebrew enough (ISO-8859-8, cp1255 and MacHebrew are supported because and just because there were mappings available at http://www.unicode.org/). Contributions welcome."
added it as an alias to the "normal" form in /usr/lib/perl/5.8.4/Encode/Alias.pm. Now mail arrives successfully and it looks great on QueueView and ArticleZoom.
This is because you yourself composed a iso-8859-8 message, which was flagged incorrectly by your mailer as iso-8859-8-i, which in the end you mapped to iso-8859-8 again.
In article_plain table (or "plain" view)this mail look wrong. The message text (pain_text message , no attachments) was encoded wrongly to UTF-8 and displays totaly wrong charachters. I've verified that browser encoding is UTF-8 and used unicode terminal to view the content of article_plain to double check that .
I have no idea about that. What does happen to a mail originally sent as iso-8859-8 instead of its counterpart, -i? Regards, Robert Kehl -- ((otrs.de)) :: OTRS GmbH :: Norsk-Data-Str. 1 :: 61352 Bad Homburg http://www.otrs.de/ :: Tel. +49 (0)6172 4832388

My comments might be boring - you might want to skip to the "Why does it work" part. Robert Kehl wrote:
On Thursday, June 10, 2004 4:43 AM Boris Ratner
wrote: The problem was occuring when i tried to send 'iso-8859-8' encoded mail to otrs 1.2.3 *with* DefaultCharset->utf-8 set. I used Mozilla-mail to do this. Mozilla mail sends 'iso-8859-8' encoded mail with encoding set to 'iso-8859-8-i'.
This discovery was puzzeling me in regard to the python script i wrote to deal with this problem. I went to python Alias.pm equivalent and found out that i have made this chage before in python (and totaly forgot about it)
One could think of Mozilla handling 8859-8-I, which is a LTR language, the same way as 8859-8, which is to be written RTL. So Mozilla changed the language direction to its wish? Um...
Mozilla doesn't have "iso-8859-8" it has only "iso-8859-8-i". The difference suppose to be the "normal" is visual and the "-i" is logical in regard to the arrangement of hebrew chars in the file. example: lets say that this is a hebrew string "MOLAHS" ("S" is actually the first char of the RTL hebrew word ) if you save it in "visual" then in the file it would look as it seen on the screen "MOLAHS" ("M" would be the first char encountered in the file). In "logical" it would be writen to a file with the "correct" char order "SHALOM" ("S" comes first in the file) and the displaying software(text editor, browser) *should* turn it around and display it as "MOLAHS" on the screen. Proof -> I'm using konsole with hebrew fonts set-up it doesn't have the BIDI algorithm "smarts". if i "#cat iso-8859-8-i.eml" i will see the hebrew inside reversed because the "cat" utility will display the chars in the order the appear in the file.
This is because you yourself composed a iso-8859-8 message, which was flagged incorrectly by your mailer as iso-8859-8-i, which in the end you mapped to iso-8859-8 again.
actually it was marked correctly as "-i" it is Logical (e.g: the software that display the message will have to Reverse the apropriate chars (and the order of words) according to the BIDI algorithm. Again the difference is in the order of the charachters (and words) in the document (e-mail,file,web page). If i understand it correctly "-i" is a new thing in the ISO. In the past there was only "iso-8859-8" and if you got a mail (or web-page) with reverse hebrew you would go and change your display encoding manualy to "iso-8859-8 (logical)" or "iso-8859-8 (visual)" whichever will give you the right display. so "-i" was added recently so programs will be able to identify themselves if they should apply the BIDI algorythm before display (logical) or not (visual). Regarding charachter map itslef -> all charachters are *exactly* the same in "iso-8859-8" and "iso-8859-8-i". Why does it work: All of the most used Mail User Agent that support hebrew send mail in "logical" char order. Mozilla-mail doesn't have "visual" char order at all. UTF-8 support for RTL languages *IS* "logical". so when you recode in perl Encode module from "SHALOM" in iso-8859-8 to utf-8 the Encode module doesn't change the char(also word) order so it is converted to "SHALOM" in utf-8 and when my browser (or MUA) display the message in UTF-8 they automatically invoke the BIDI algorythm and display it correctly as "MOLAHS". That is what happens when a mail arrives into "article" table of my otrs database - when otrs does the apropriate HTML generation and it doesn't (and shoudn't) care what the order of the hebrew chars, but my browser does know that if its UTF-8 RTL language it should be displayed in reverse char (and word) order.
In article_plain table (or "plain" view)this mail look wrong. The message text (pain_text message , no attachments) was encoded wrongly to UTF-8 and displays totaly wrong charachters. I've verified that browser encoding is UTF-8 and used unicode terminal to view the content of article_plain to double check that .
I have no idea about that. What does happen to a mail originally sent as iso-8859-8 instead of its counterpart, -i?
It's not hebrew specific at all. Something is definetly wrong in the conversion of the ArticlePlain = {email}->as_string; to utf-8 when DefaultCharset -> utf-8 set. I'm having trouble finding the exact location where the recoding to utf-8 performed. My poor Perl abilities do not help. If someone would explain which modules are involved in recoding i'll probably find the problem. It's also might be a Postgresql problem. in table "article" mail components are all "varchar" in table "article_plain" the "body" field is of type "text" - postgre might have problems handling encodings in different field types. This problem has nothing to do with RTL or LTR or whatever. Same thing happens when i pipe test-email-8-bulgarian-cp1251.box (from the 1.2.3 tar.gz distro) into Postmaster.pl It looks nice and cyrillic (bulgarian sounds funnie to a russian reader like me :-) on QueueView and ArticleZoom (data taken from "article" table) when you click on "plain" link you see that the message text itself is gibrish (data taken from "article_plain" table. looks like "eropean" letters). A simple "SELECT" in uxterm (unicode-able xterm) query approves that the "message body" data for that article_id differ in those tables. As the subject states - the main issue is soved. But the thing is that the "plain" feature looses it's relevance. It doesn't keep the *original* mime source. regards, Boris Ratner.

The issue of wrong charachters in the "plain" view of the article is solved by adding some charset re-coding code to Kernel::System::EmailParser->GetPlainEmail(); I have found the following code to work (on my system) sub GetPlainEmail { my $Self = shift; return $Self->{EncodeObject}->Decode( Text => $Self->{Email}->as_string(), From => $Self->GetCharset() ); } Tell me what you think. Kind Regards, Boris Ratner.
participants (4)
-
Alessandro Ranellucci
-
Boris Ratner
-
boris@goldenmyth.co.il
-
Robert Kehl