Storing of large/binary data in Non-MySQL databases

Christoph Zwerschke

6 May 2015 6 May '15

7:40 p.m.

When trying to migrate our OTRS database from MySQL to PostgreSQL, I noticed that the MySQL database uses LONGBLOBs in several places, which are binary objects, while on the other hand, OTRS creates TEXT columns for these columns when using PostgreSQL, which is a character type, not binary. Similarly, for Oracle and SQLServer, OTRS also uses character types for these columns. This makes it difficult to migrate between the databases, because you need to transform from binary to character, using the proper encoding. From looking at the source code, most of the LONGBLOBs are assumed to be UTF-8 encoded strings which can be converted to Postgres TEXT with convert_to(column, 'utf8'), while other LONGBLOBs, e.g. those storing representing attachments are assumed to be binary resp. the original encoding of the attachment. These are handled differently by OTRS - for databases other than MySQL, OTRS stores them in BASE64 encoding. The conversion can be done with Postgres using encode(column, 'base64'). My question is: Why do the other databases use character objects for all the LONGBLOBs including those that really contain binary data, so that for these databases, a BASE64 encoding is necessary, which costs performance when encoding and decoding and 33% more storage? Wouldn't it be much better and easier if OTRS used BLOBs consistently across all databases for the binary columns, and TEXT for the other large columns? All the databases support this. The BLOBs are just called BYTEA in Postgres. And MySQL also has LONGTEXT, no need to use LONGBLOBs if it's actually text. Also, would it be possible to use BYTEA in PostgreSQL instead of TEXT for the binary data columns, changing the database scheme accordingly and setting the DB::DirectBlob flag to 1 so that OTRS stops BASE64 encoding these columns? -- Christoph

Show replies by date

David Boyes

7 May 7 May

1:02 p.m.

...

These are handled differently by OTRS - for databases other than MySQL, OTRS stores them in BASE64 encoding. The conversion can be done with Postgres using encode(column, 'base64').

I'd rather see MySQL use base64 too. All the world isn't Intel-based, and at least base64 is not ambiguous what the byte values really are.

...

My question is: Why do the other databases use character objects for all the LONGBLOBs including those that really contain binary data, so that for these databases, a BASE64 encoding is necessary, which costs performance when encoding and decoding and 33% more storage? Wouldn't it be much better and easier if OTRS used BLOBs consistently across all databases for the binary columns, and TEXT for the other large columns?

It would make it much more difficult to move data cross-platform. Try setting up a OTRS install on a Linux on POWER big-endian machine and moving the database from your Intel system. Those binary blobs are a royal pain. The base64 columns move perfectly and without any action on the admin's part.

...

Also, would it be possible to use BYTEA in PostgreSQL instead of TEXT for the binary data columns, changing the database scheme accordingly and setting the DB::DirectBlob flag to 1 so that OTRS stops BASE64 encoding these columns?

Are you storing attachments in the database? If so, you might consider changing that. Most of the big blobs are related to that, and the non-existence of those big opaque blobs in the database makes this issue a lot less relevant (to the point of being almost not measurable). (in fact, I think that should be the default...)

Christoph Zwerschke

3:26 p.m.

Am 07.05.2015 um 15:02 schrieb David Boyes:

...

...
These are handled differently by OTRS - for databases other than MySQL, OTRS stores them in BASE64 encoding. The conversion can be done with Postgres using encode(column, 'base64').

I'd rather see MySQL use base64 too. All the world isn't Intel-based, and at least base64 is not ambiguous what the byte values really are.

LONGBLOBs are just streams of bytes. Intel or not (endianness) should not matter when you fetch and store the data using database methods. BLOBs are good for storing binary objects such as attachments or emails that you want to keep in their original encoding. Everything else should be stored as (unicode) text. Storing binary data BASE64 encoded is not optimal when the database supports BLOBs.

...

It would make it much more difficult to move data cross-platform. Try setting up a OTRS install on a Linux on POWER big-endian machine and moving the database from your Intel system. Those binary blobs are a royal pain. The base64 columns move perfectly and without any action on the admin's part.

I had no difficulties moving the data from the MySQL LONGBLOBs columns to PostgreSQL BYTEA columns. They were exported as byte streams and imported as byte streams. The difficulties only appeared because OTRS treats Postgres and the other databases differently from MySQL, as if they would not support binary data, for no apparent reason.

...

Are you storing attachments in the database? If so, you might consider changing that.

Thanks for mentioning that, I'll consider it. Found the script ArticleStorageSwitch.pl which should make this possible. However, it may also have disadvantages. You need to backup database and filesystem separately, and consistency between the two is not automatically guaranteed through foreign keys. Also, there are some other BASE64 encoded columns besides those for storing the attachments. -- Christoph

David Boyes

4:17 p.m.

...

LONGBLOBs are just streams of bytes. Intel or not (endianness) should not matter when you fetch and store the data using database methods. BLOBs are good for storing binary objects such as attachments or emails that you want to keep in their original encoding. Everything else should be stored as (unicode) text. Storing binary data BASE64 encoded is not optimal when the database supports BLOBs.

No argument that it's not optimal, but at least it's the *same* representation regardless of database engine or platform. And base64 encoding guarantees that the same input string is interpreted identically on all platforms and all databases. I have a few sites that deal in full 32 bit ISO 10646 character sets, which add a whole another dimension to the annoyance of dealing with binary blobs. Use of base64 also allows databases on non-ASCII platforms to deliver and store the data in a portable form (for example, z/OS DB/2 makes a really nice OTRS back end if you have it, and DFSMShsm knows what to do with USS files, so automatic recall from tape works like a champ. I don't have to worry about attachment file sizes at all. 8-)).

...

The difficulties only appeared because OTRS treats Postgres and the other databases differently from MySQL, as if they would not support binary data, for no apparent reason.

I think we're arguing for the same point, but from different starting positions. I agree that there shouldn't be a difference between how MySQL and the other databases store the data. Given that the parsing is (and should be) done in the application logic, the overhead should be fairly minimal, and distributed.

...

...
Are you storing attachments in the database? If so, you might consider changing that.

Thanks for mentioning that, I'll consider it. Found the script ArticleStorageSwitch.pl which should make this possible. However, it may also have disadvantages. You need to backup database and filesystem separately, and consistency between the two is not automatically guaranteed through foreign keys. Also, there are some other BASE64 encoded columns besides those for storing the attachments.

Yes, you're correct, but the vast majority of the performance impact you're commenting on is going to be dealing with the attachment blobs. The other columns are tiny in comparison. I also think you'll find that your backup load is much smaller with attachments outside the database -- every time you change the database, the whole file has to get dumped (unless you have a fairly smart backup system and are doing table-level backups from inside the database). Having the attachments outside the database will significantly reduce the amount of data backed up, in that change detection is much more effective with external files. Most databases are really not optimized for use as opaque object stores. Keep in mind that the attachments are referenced only when opened. It's just opening an opaque object reference, so I don't see what consistency guarantees (other than the correct file gets opened when clicked on) you need. The object exists, or it doesn't. If it doesn't, you know what to do, and it's operationally a lot easier to manage whether a file exists or not. It also allows you to use much more intelligent object stores (such as Ceph or DFSMShsm on z/OS) with corresponding improvements in storage performance and compliance auditing capabilities.

Christoph Zwerschke

9:58 p.m.

Am 07.05.2015 um 18:17 schrieb David Boyes:

...

I also think you'll find that your backup load is much smaller with attachments outside the database -- every time you change the database, the whole file has to get dumped (unless you have a fairly smart backup system and are doing table-level backups from inside the database).

That's a good point, sure. You only must make sure that the backup intervals of the database and the filesystem are similar to get a consistent system from recovery.

...

Keep in mind that the attachments are referenced only when opened. It's just opening an opaque object reference, so I don't see what consistency guarantees (other than the correct file gets opened when clicked on) you need. The object exists, or it doesn't.

I mainly want to make sure that no stale files (files which have no corresponding tickets in the database) accumulate over time. But you're right, as tickets and articles are usually not deleted in OTRS (or cannot even be deleted by users) that should be no problem. -- Christoph

3725

Age (days ago)

3726

Last active (days ago)

List overview

Download

4 comments

2 participants

participants (2)

Christoph Zwerschke
David Boyes