Ideas about Fulltext Searching (performance)?

Martin Edenhofer

19 Apr 2008 19 Apr '08

11:29 a.m.

Hi, today I was thinking about improvements for fulltext search (special body of article) which I want to discuss here. :) Current situation / Use case: o OTRS 2.x o Intel Core2 2.40GHz A fulltext search on body with "smith" takes about 15 sek on large installations (done by normal LIKE search). Has somebody ideas how to improve this? I do have 3 ideas. a) Striping/Unique Content -------------------------- The idea is quite easy. Just creating a mirror article table (e. g. article_search) and store all bodies of the article table with unique words of one article in the article_search table. So OTRS can use this table 1:1 for fulltext searches. b) Word-Table/Ref-Word-Article-Table ------------------------------------ This idea is to put all used words in article body into a search_word (id,word) table and put the relations what words are used in the article body into a search_word_article (word_id,article_id). So OTRS can use this table for fulltext searches. c) Create Index of frequent search words ---------------------------------------- This idea is to put all frequently searched words in an extra index. And do not frequently search words in real time. Result with about 12.000 tickets and 50.000 article: | article |a) article | b)search_words | c)dedicated | (default) | mirror uniq | and ref table | search_word | | words | | index ------------+--------------+---------------+---------------- +--------------- Storage Size| 60M | 16M | 23M | 8M/ 2000words ------------+--------------+---------------+---------------- +--------------- Search Time | 4s | 2s | 1s | Index 1s | | | | No Index 4s ------------+--------------+---------------+---------------- +--------------- Time to | 0 min | 30 min | 2.5 h | 1 h create Index| | | | ------------+--------------+---------------+---------------- +--------------- Result with about 70.000 tickets and 240.000 article: | article |a) article | b)search_words | c)dedicated | (default) | mirror uniq | and ref table | search_words | | words | | ------------+--------------+---------------+---------------- +--------------- Storage Size| 440M | 121M | 394M | 50M/ 2000words ------------+--------------+---------------+---------------- +--------------- Search Time | 17s | 5.8s | 48s | Index 3.0s | | | | No Index 17s ------------+--------------+---------------+---------------- +--------------- Time to | 0 min | 1 h | 9 h | 4 h create Index| | | | ------------+--------------+---------------+---------------- +--------------- So in my opinion for small setups b) would work very fine but would not scale to lager installations. c) looks good for large installations till not indexed words are searched (e. g. searches for CVE oder other numbers would always use no indexed search). a) looks robust to both, large and small installations. Any other ideas or questions? -Martin -- ((otrs)) :: OTRS AG :: Europaring 4 :: D - 94315 Straubing Fon: +49 (0) 9421 56818 0 :: Fax: +49 (0) 9421 56818 18 http://www.otrs.com/ :: Communication with success! Address of record: Bad Homburg Local Court: Bad Homburg, HRB 10751 Tax number: 003/240/97505 Chairman: Burchard Steinbild Chief Executive Officer: André Mindermann

Show replies by date

Oliver Tappe

19 Apr 19 Apr

1:51 p.m.

Hi, On 2008-04-19 at 13:29:04 [+0200], Martin Edenhofer wrote:

...

today I was thinking about improvements for fulltext search (special body of article) which I want to discuss here. :)

Current situation / Use case:

o OTRS 2.x o Intel Core2 2.40GHz

A fulltext search on body with "smith" takes about 15 sek on large installations (done by normal LIKE search).

Has somebody ideas how to improve this?

One idea would be to make use of any fulltext indexing capability that the DBMS provides. I have no idea if all supported DBMS provide such a feature but I do know that mysql, oracle and mssql have it.

...

I do have 3 ideas.

a) Striping/Unique Content -------------------------- The idea is quite easy. Just creating a mirror article table (e. g. article_search) and store all bodies of the article table with unique words of one article in the article_search table. So OTRS can use this table 1:1 for fulltext searches.

This setup reduces the size of the article table, but it will not scale well with large datasets, since searching will still involve a full table scan of the article table. So while being a bit faster than searching the real article table, this strategy makes it impossible to search for word phrases like "%ticket search%" (though I do not know if OTRS currently supports that at all).

...

b) Word-Table/Ref-Word-Article-Table ------------------------------------ This idea is to put all used words in article body into a search_word (id,word) table and put the relations what words are used in the article body into a search_word_article (word_id,article_id). So OTRS can use this table for fulltext searches.

This should speed up searching through large datasets considerably, especially if the search phrases are anchored at the beginning ('content' or 'conte%', but not '%ntent'), since the search can then use the index and skip the full table scan altogether. AFAICS, the DBMS-implementations of fulltext indices more or less use this approach and optimize it by maintaing a list of stop words (words considered irrelevant for search results) that are left out from indices and dropped from the given list of search keys. So if we'd add stop words to that strategy, I think it should improve the searching speed a lot.

...

c) Create Index of frequent search words ---------------------------------------- This idea is to put all frequently searched words in an extra index. And do not frequently search words in real time.

I suppose the quality and speed of this strategy depends a lot on the meaning of "frequently searched words". How did you determine these?

...

Result with about 12.000 tickets and 50.000 article:

| article |a) article | b)search_words | c)dedicated | (default) | mirror uniq | and ref table | search_word | | words | | index ------------+--------------+---------------+---------------- +--------------- Storage Size| 60M | 16M | 23M | 8M/ 2000words ------------+--------------+---------------+---------------- +--------------- Search Time | 4s | 2s | 1s | Index 1s | | | | No Index 4s ------------+--------------+---------------+---------------- +--------------- Time to | 0 min | 30 min | 2.5 h | 1 h create Index| | | | ------------+--------------+---------------+---------------- +---------------

Result with about 70.000 tickets and 240.000 article:

| article |a) article | b)search_words | c)dedicated | (default) | mirror uniq | and ref table | search_words | | words | | ------------+--------------+---------------+---------------- +--------------- Storage Size| 440M | 121M | 394M | 50M/ 2000words ------------+--------------+---------------+---------------- +--------------- Search Time | 17s | 5.8s | 48s | Index 3.0s | | | | No Index 17s ------------+--------------+---------------+---------------- +--------------- Time to | 0 min | 1 h | 9 h | 4 h create Index| | | | ------------+--------------+---------------+---------------- +---------------

So in my opinion for small setups b) would work very fine but would not scale to lager installations.

Hm, this surprises me, since being able to use an index should really speed up the search on larger datasets, not slow it down - as is indicated by the numbers given above.

...

c) looks good for large installations till not indexed words are searched (e. g. searches for CVE oder other numbers would always use no indexed search). a) looks robust to both, large and small installations.

...

From an algorithmic perspective, a) should be no considerable improvement over the original (current) searching strategy. Both do not involve any index, so they will correlate linearly to the size of the dataset - O(n) [and get much worse if the dataset fails to fit into memory]. Making use of an index as in c) and d) should improve that correlation to O(log n).

Don't get me wrong, I'm not saying you're numbers are wrong, but I am merely struggling to find out why these numbers are the way they are.

...

Any other ideas or questions?

Yes, I do have a couple of questions: - What were the exact SQL-statements you used to yield those numbers? - In strategy b), there was an index on search_words.word, right? ;-) - The two datasets you mentioned - did you create them artificially or are they copies of real setups (i.e. is the content "real" or computer generated)? So far for now. cheers, Oliver -- ((otrs)) :: OTRS AG :: Europaring 4 :: D - 94315 Straubing Fon: +49 (0) 9421 56818-0 :: Fax: +49 (0) 9421 56818-18 http://www.otrs.com/ :: Communication with success!

Martin Edenhofer

2:15 p.m.

Hi, On Apr 19, 2008, at 15:51 , Oliver Tappe wrote:

...

One idea would be to make use of any fulltext indexing capability that the DBMS provides. I have no idea if all supported DBMS provide such a feature but I do know that mysql, oracle and mssql have it.

I tried this for MySQL and DB2. The problem on both is they only create indexes on words (so far as I figured out). So I was running into not finding ticket#... so I tried other ideas... but maybe I'm wrong. Of course, they also would support other cool features like "sounds like"... :)

...

...
c) Create Index of frequent search words ---------------------------------------- This idea is to put all frequently searched words in an extra index. And do not frequently search words in real time.

I suppose the quality and speed of this strategy depends a lot on the meaning of "frequently searched words". How did you determine these?

At this point I was not determine this, I only took some words to find out what performance i get. ;)

...

...
So in my opinion for small setups b) would work very fine but would not scale to lager installations.

Hm, this surprises me, since being able to use an index should really speed up the search on larger datasets, not slow it down - as is indicated by the numbers given above.

...
c) looks good for large installations till not indexed words are searched (e. g. searches for CVE oder other numbers would always use no indexed search). a) looks robust to both, large and small installations.

...
From an algorithmic perspective, a) should be no considerable improvement over the original (current) searching strategy. Both do not involve any index, so they will correlate linearly to the size of the dataset - O(n) [and get much worse if the dataset fails to fit into memory]. Making use of an index as in c) and d) should improve that correlation to O(log n).

Don't get me wrong, I'm not saying you're numbers are wrong, but I am merely struggling to find out why these numbers are the way they are.

...
Any other ideas or questions?

Yes, I do have a couple of questions:

- What were the exact SQL-statements you used to yield those numbers?

May be this is the main problem. A ticket search means I need to search over different tables and attributes. So in this case I need to involve the ticket, article, queue table. I also need to keep the permissions on ticket/queues in mind. A query looks like: SELECT DISTINCT st.id, st.tn, st.create_time_unix FROM ticket st, queue sq , article at WHERE sq.id = st.queue_id AND st.id = at.ticket_id AND sq.group_id IN (1, 2, 240, 3, 31, 312, 32) AND ( LOWER(at.a_body) LIKE LOWER('%smith%')) ORDER BY st.create_time_unix DESC So the problem is, that I only need to search article where I have ro access.

...

- In strategy b), there was an index on search_words.word, right? ;-)

Yes. :)

...

- The two datasets you mentioned - did you create them artificially or are they copies of real setups (i.e. is the content "real" or computer generated)?

Yes. Data from live databases. -Martin -- ((otrs)) :: OTRS AG :: Europaring 4 :: D - 94315 Straubing Fon: +49 (0) 9421 56818 0 :: Fax: +49 (0) 9421 56818 18 http://www.otrs.com/ :: Communication with success! Address of record: Bad Homburg Local Court: Bad Homburg, HRB 10751 Tax number: 003/240/97505 Chairman: Burchard Steinbild Chief Executive Officer: André Mindermann

Oliver Tappe

3:06 p.m.

On 2008-04-19 at 16:15:55 [+0200], Martin Edenhofer wrote:

...

On Apr 19, 2008, at 15:51 , Oliver Tappe wrote:

...
One idea would be to make use of any fulltext indexing capability that the DBMS provides. I have no idea if all supported DBMS provide such a feature but I do know that mysql, oracle and mssql have it.

I tried this for MySQL and DB2. The problem on both is they only create indexes on words (so far as I figured out). So I was running into not finding ticket#... so I tried other ideas... but maybe I'm wrong.

I think with mysql searching for "ticket#1234" should work and you can even apply globs, too (conten*).

...

Of course, they also would support other cool features like "sounds like"... :)

Yep, and they provide a relevance ranking, too :-) [ ... ]

...

...
- What were the exact SQL-statements you used to yield those numbers?

May be this is the main problem. A ticket search means I need to search over different tables and attributes. So in this case I need to involve the ticket, article, queue table. I also need to keep the permissions on ticket/queues in mind.

A query looks like:

SELECT DISTINCT st.id, st.tn, st.create_time_unix FROM ticket st, queue sq , article at WHERE sq.id = st.queue_id AND st.id = at.ticket_id AND sq.group_id IN (1, 2, 240, 3, 31, 312, 32) AND ( LOWER(at.a_body) LIKE LOWER('%smith%')) ORDER BY st.create_time_unix DESC

Yes, I know how the standard OTRS search query looks like, but for strategy b) you'd have to use a different statement, as otherwise the index would never be used. Did the SQL statement for strategy b) still include "LOWER(at.a_body) LIKE LOWER('%smith%')"? In general, I'd suggest to evaluate searching through articles in an isolated manner (leaving out all the other search keys and the access rights). This way it is much easier to compare the performance of the different searching strategies (as there won't be any side-effects by all those other search parameters). It would always be possible to do any fulltext search first and then reduce the results of that by applying all the other checks in a second SQL statement. If no keywords have been entered, the first (fulltext) part could be skipped, of course.

...

So the problem is, that I only need to search article where I have ro access.

...
- In strategy b), there was an index on search_words.word, right? ;-)

Yes. :)

...
- The two datasets you mentioned - did you create them artificially or are they copies of real setups (i.e. is the content "real" or computer generated)?

Yes. Data from live databases.

Ah, good. cheers, Oliver -- ((otrs)) :: OTRS AG :: Europaring 4 :: D - 94315 Straubing Fon: +49 (0) 9421 56818-0 :: Fax: +49 (0) 9421 56818-18 http://www.otrs.com/ :: Communication with success!

Martin Edenhofer

3:18 p.m.

Hi, On Apr 19, 2008, at 17:06 , Oliver Tappe wrote:

...

...
...
One idea would be to make use of any fulltext indexing capability that the DBMS provides. I have no idea if all supported DBMS provide such a feature but I do know that mysql, oracle and mssql have it.

I tried this for MySQL and DB2. The problem on both is they only create indexes on words (so far as I figured out). So I was running into not finding ticket#... so I tried other ideas... but maybe I'm wrong.

I think with mysql searching for "ticket#1234" should work and you can even apply globs, too (conten*).

Ok, I'll have a second look on it...

...

...
...
- What were the exact SQL-statements you used to yield those numbers?

May be this is the main problem. A ticket search means I need to search over different tables and attributes. So in this case I need to involve the ticket, article, queue table. I also need to keep the permissions on ticket/queues in mind.

A query looks like:

SELECT DISTINCT st.id, st.tn, st.create_time_unix FROM ticket st, queue sq , article at WHERE sq.id = st.queue_id AND st.id = at.ticket_id AND sq.group_id IN (1, 2, 240, 3, 31, 312, 32) AND ( LOWER(at.a_body) LIKE LOWER('%smith%')) ORDER BY st.create_time_unix DESC

Yes, I know how the standard OTRS search query looks like, but for strategy b) you'd have to use a different statement, as otherwise the index would never be used. Did the SQL statement for strategy b) still include "LOWER(at.a_body) LIKE LOWER('%smith%')"?

In general, I'd suggest to evaluate searching through articles in an isolated manner (leaving out all the other search keys and the access rights). This way it is much easier to compare the performance of the different searching strategies (as there won't be any side-effects by all those other search parameters). It would always be possible to do any fulltext search first and then reduce the results of that by applying all the other checks in a second SQL statement. If no keywords have been entered, the first (fulltext) part could be skipped, of course.

Yes, on a), b) and c) I only created lower case indexes and lower case queries, so SQL LOWER was not needed anymore. :) I also like the idea to split/isolate fulltext searches form permission and other ticket attributes. But I do see one problem. Maybe you have an Idea. In case I'm searching for, lets say "bmw" and I get 8.000 hits. So I take this 8.000 out of the database, then I need to do the permission check (and in case, maybe other ticket attribute checks like state, created, ...). But just focused on permission check will kill my winning time of fulltext search (because of DB lookups for permission check). Do you have an idea how to handle this? -Martin -- ((otrs)) :: OTRS AG :: Europaring 4 :: D - 94315 Straubing Fon: +49 (0) 9421 56818 0 :: Fax: +49 (0) 9421 56818 18 http://www.otrs.com/ :: Communication with success! Address of record: Bad Homburg Local Court: Bad Homburg, HRB 10751 Tax number: 003/240/97505 Chairman: Burchard Steinbild Chief Executive Officer: André Mindermann

Oliver Tappe

10:27 p.m.

On 2008-04-19 at 17:18:43 [+0200], Martin Edenhofer wrote:

...

On Apr 19, 2008, at 17:06 , Oliver Tappe wrote:

[ ... ]

...

Yes, on a), b) and c) I only created lower case indexes and lower case queries, so SQL LOWER was not needed anymore. :)

Getting rid of LOWER is good. It would be great if we'd be able to get rid of LIKE, too - especially the version starting with a %. But I am a bit puzzled: what index did you create for strategy a)?

...

I also like the idea to split/isolate fulltext searches form permission and other ticket attributes.

But I do see one problem. Maybe you have an Idea.

In case I'm searching for, lets say "bmw" and I get 8.000 hits. So I take this 8.000 out of the database, then I need to do the permission check (and in case, maybe other ticket attribute checks like state, created, ...). But just focused on permission check will kill my winning time of fulltext search (because of DB lookups for permission check). Do you have an idea how to handle this?

Well, we could pick up the 8000 matching article IDs from the DB and then prepare the second SQL (the one that does the permission-related and other checks) with a large IN clause for the IDs (containing maybe 500 or 1000 ?-placeholders). That statement would then be executed in a loop (passing in the next 500 or 1000 IDs via bind params in each round). But all this would have to be tested and tried for every DBMS, I'm afraid, as some are probably going to have limits on the number of bind params that can bee used in a single statement. I think if we do some testing across DBMSes, we will find out that some are faster with one big statement (doing keyword search and everything else in one go) and others are faster with the split approach. cheers, Oliver -- ((otrs)) :: OTRS AG :: Europaring 4 :: D - 94315 Straubing Fon: +49 (0) 9421 56818-0 :: Fax: +49 (0) 9421 56818-18 http://www.otrs.com/ :: Communication with success!

6329

Age (days ago)

6329

Last active (days ago)

List overview

Download

5 comments

2 participants

participants (2)

Martin Edenhofer
Oliver Tappe